Audio ML Papers

Last 7 Days (March 09 - March 16, 2026)

Subcategories: All (45) | Speech Synthesis (7) | Music Synthesis (2) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (3) | Asr (9) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (18)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 88)
Yaofeng Su, Yuming Li, Zeyue Xue ... ยท arXiv
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidire...
#2 TOP PAPER (Score: 86)
Aviad Dahan, Moran Yanuka, Noa Kraicer ... ยท arXiv
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference record...
#3 TOP PAPER (Score: 86)
Kele Xu, Yifan Wang, Ming Feng ... ยท arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
Thursday, March 12, 2026
Yaofeng Su, Yuming Li, Zeyue Xue ... ยท arXiv
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidire...
Kele Xu, Yifan Wang, Ming Feng ... ยท arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
Yongjoon Lee, Jung-Woo Choi ยท arXiv
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote gen...
Xiquan Li, Junxi Liu, Wenxi Chen ... ยท arXiv
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direc...
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim ... ยท ICLR 2026
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding l...
Joonyong Park, Jerry Li ยท arXiv
Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliabl...
Xiangyuan Xue, Jiajun Lu, Yan Gao ... ยท arXiv
Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the p...
Wednesday, March 11, 2026
Hillary Mutisya, John Mugane ยท arXiv
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we a...
Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez ... ยท arXiv
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-a...
Duojia Li, Shuhan Zhang, Zihan Qian ... ยท arXiv
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases l...
Kaituo Xu, Yan Jia, Kai Huang ... ยท arXiv
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All m...
Jing Peng, Ziyi Chen, Haoyu Li ... ยท arXiv
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritiz...
Yuanbo Hou, Yanru Wu, Qiaoqiao Ren ... ยท arXiv
Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to se...
Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang ... ยท arXiv
The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than fem...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... ยท arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Yinfeng Xia, Jian Tang, Junfeng Hou ... ยท arXiv
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unifie...
Nolan Chan, Timmy Gang, Yongqian Wang ... ยท ICASSP 2026
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core in...
Yujie Liao, Xuelong Geng, Hongfei Xue ... ยท arXiv
Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant ga...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... ยท arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov ... ยท arXiv
The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD met...
Hao Shi, Yusuke Fujita, Roman Koshkin ... ยท arXiv
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encode...
Tianyu Xu, Sieun Kim, Qianhui Zheng ... ยท arXiv
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-...
Tuesday, March 10, 2026
Aviad Dahan, Moran Yanuka, Noa Kraicer ... ยท arXiv
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference record...
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang ... ยท arXiv
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges o...
Haoyuan Yang, Mu Yang, Jiamin Xie ... ยท arXiv
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion back...
Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo ... ยท arXiv
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-a...
Dehua Tao, Xuan Luo, Daxin Tan ... ยท arXiv
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for ...
Xiaobin Rong, Jun Gao, Zheng Wang ... ยท arXiv
Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achie...
Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak ยท arXiv
Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, ...
Soumya Dutta ยท arXiv
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human e...
Laya Iyer, Angelina Wang, Sanmi Koyejo ยท EACL 2026
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech re...
Rui Wang, Zhifei Zhang, Yu Gao ... ยท arXiv
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimi...
Monday, March 09, 2026
Pol Buitrago, Pol Gร lvez, Oriol Pareras ... ยท arXiv
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resourc...
Avihu Dekel, Samuel Thomas, Takashi Fukada ... ยท arXiv
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully ...
Bence Mark Halpern, Thomas Tienkamp, Defne Abur ... ยท arXiv
Automatic speech intelligibility assessment is crucial for monitoring speech disorders and therapy efficacy. However, existing methods are difficult to compare: research is fragmented across private datasets with inconsistent protocols. We introduce PathBench, a unified benchmark...
Nikita Kuzmin, Tao Zhong, Jiajun Deng ... ยท arXiv
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states o...
Andong Li, Tong Lei, Zhihang Sun ... ยท arXiv
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inher...
Andong Li, Tong Lei, Zhihang Sun ... ยท arXiv
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inher...
Ayush Barik, Sofia Stoica, Nikhil Sarda ... ยท arXiv
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusi...
Henry Li Xinyuan, Zexin Cai, Lin Zhang ... ยท arXiv
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice...
Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan ... ยท arXiv
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Addit...
Zihao Fang, Yingda Shen, Zifan Guan ... ยท arXiv
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic re...
Shangeth Rajaa ยท arXiv
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, whic...
Lucas Rakotoarivony ยท arXiv
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In p...
Phillip Long, Zachary Novack, Chris Donahue ยท arXiv
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchm...