Audio ML Papers

Last 7 Days (March 28 - April 04, 2026)

Subcategories: All (25) | Speech Synthesis (6) | Music Synthesis (0) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (2) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 84)
Runkun Chen, Yixiong Fang, Pengyu Chang ... · arXiv
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to le...
#2 TOP PAPER (Score: 83)
Detai Xin, Shujie Hu, Chengzuo Yang ... · arXiv
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongC...
#3 TOP PAPER (Score: 83)
Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar ... · arXiv
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to t...
Thursday, April 02, 2026
Chengyou Wang, Hongfei Xue, Chunjiang He ... · arXiv
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches eith...
Xiaobin Rong, Yushi Wang, Zheng Wang ... · ICASSP 2026
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the wa...
Yi Ma, Shuai Wang, Tianchi Liu ... · IEEE Transactions on Audio, Speech and Language Processing
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonet...
Hongjun Liu, Rujun Han, Leyu Zhou ... · arXiv
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical...
Chihiro Arata, Kiyoshi Kurihara · arXiv
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We p...
Fuxiang Tao, Dongwei Li, Shuning Tang ... · arXiv
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initiall...
Wednesday, April 01, 2026
Vojtěch Staněk, Martin Perešíni, Lukáš Sekanina ... · WCCI CEC 2026
While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary mu...
Jeremy Zhengqi Huang, Emani Hicks, Sidharth ... · arXiv
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an int...
Xiquan Li, Xuenan Xu, Ziyang Ma ... · arXiv
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with...
Awais Khan, Muhammad Umar Farooq, Kutub Uddin ... · arXiv
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and m...
Tuesday, March 31, 2026
Ui-Hyeop Shin, Hyung-Min Park · arXiv
Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-spl...
Ashish Seth, Sonal Kumar, Ramaneswaran Selvakumar ... · arXiv
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to t...
Detai Xin, Shujie Hu, Chengzuo Yang ... · arXiv
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongC...
Sahil Kumar, Namrataben Patel, Honggang Wang ... · ICLR 2026
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled cond...
Zeyu Jin, Songtao Zhou, Haoyu Wang ... · ICLR 2026
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like ...
Monday, March 30, 2026
Runkun Chen, Yixiong Fang, Pengyu Chang ... · arXiv
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to le...
Marco Hidalgo-Araya, Raphaël Trésor, Bart Van Erp ... · arXiv
Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduc...
Kexin Huang, Liwei Fan, Botian Jiang ... · arXiv
Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applicat...
Anuj Diwan, Eunsol Choi, David Harwath · arXiv
We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond...
Sujith Pulikodan, Abhayjeet Singh, Agneedh Basu ... · arXiv
Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India's linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process tha...
Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar ... · ICASSP 2026
Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We...
Jia-Kai Dong, Yu-Xiang Lin, Hung-Yi Lee · arXiv
We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind...
Sunday, March 29, 2026
Xinyuan Xie, Shunian Chen, Zhiheng Liu ... · arXiv
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extra...
Yuan Zhao, Zhenqi Jia, Yongqiang Zhang · arXiv
Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the qualit...
Saturday, March 28, 2026
Hao Shi, Yuan Gao, Xugang Lu ... · arXiv
Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems i...