Audio ML Papers

Last 7 Days (June 12 - June 19, 2026)

Subcategories: All (7) | Speech Synthesis (1) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (1) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (2)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Yan Han, Zhibin Wen, Yuan Wang ... · arXiv
The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sof...
#2 TOP PAPER (Score: 83)
Chengxi Deng, Xurong Xie, Shujie Hu ... · Interspeech 2026
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utteran...
#3 TOP PAPER (Score: 83)
Haixin Zhao, Nilesh Madhu · Interspeech 2026
This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous...
Monday, June 15, 2026
Yan Han, Zhibin Wen, Yuan Wang ... · arXiv
The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sof...
Chengxi Deng, Xurong Xie, Shujie Hu ... · Interspeech 2026
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utteran...
Haixin Zhao, Nilesh Madhu · Interspeech 2026
This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous...
Haocheng Dong, Yuheng Lu, Cheng Gong ... · arXiv
With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges,...
Yonghyun Kim, Junwon Lee, Haiwen Xia ... · arXiv
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) vote...
Sunday, June 14, 2026
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar ... · IJCAI-ECAI 2026
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, caus...
Jialong Mai, Jinxin Ji, Xiaofen Xing ... · arXiv
Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears ...