Audio ML Papers

Last 7 Days (June 11 - June 18, 2026)

Subcategories: All (21) | Speech Synthesis (6) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (1) | Asr (4) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (8)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 88)
Junlong Tong, Wenqi Xu, Yingqi Fan ... · arXiv
Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a con...
#2 TOP PAPER (Score: 84)
Yonghyun Kim, Junwon Lee, Haiwen Xia ... · arXiv
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) vote...
#3 TOP PAPER (Score: 84)
Salman Hussain Ali, Umberto Cappellazzo, Mirco Ravanelli · Interspeech 2026
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Mea...
Wednesday, June 17, 2026
Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh ... · Interspeech 2026
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot a...
Jaehyuk Jang, Kangwook Ko, Wonjun Lee ... · arXiv
Few-shot adaptation of pretrained Audio--Language Models (ALMs) often improves seen-class performance at the cost of unseen-class generalization, leading to the base-to-new trade-off. We attribute this failure to zero-shot drift in the text embedding space: few-shot tuning can di...
Yizhuo Yang, Junqiao Fan, Shenghai Yuan ... · arXiv
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degr...
Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas ... · ICML 2026 Workshop on Machine Learning for Audio (Learning to Listen)
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of th...
Hyebin Cho, Suho Yoo, Jaehyuk Jang ... · arXiv
While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio...
Tuesday, June 16, 2026
Alexander Polok, Samuele Cornell, Sathvik Udupa ... · Interspeech 2026
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization ...
Zheqi Dai, Guangyan Zhang, Zhen Ye ... · arXiv
Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional ...
Monday, June 15, 2026
Dong Yang, Yuki Saito, Wataru Nakata ... · arXiv
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level ...
Haotian Qi, Gabriel Skantze · arXiv
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic ...
Alex Gichamba, Moise Busogi · Interspeech 2026
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate d...
Sunday, June 14, 2026
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar ... · IJCAI-ECAI 2026
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, caus...
Saturday, June 13, 2026
Liming Wang, Cody Karjadi, Rhoda Au ... · arXiv
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held ...
Zhenwei Mou, Weili Jiang, Liping Chen ... · INTERSPEECH 2026
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking...
Zhenwei Mou, Liping Chen, Yajun Hu ... · INTERSPEECH 2026
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient ...
Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen · arXiv
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phonem...
Friday, June 12, 2026
Hui Geng, Yi Su, Han Yin ... · arXiv
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quali...
Thursday, June 11, 2026
Soumyajit Mitra, Prabhat Pandey, Abhinav Jain ... · Interspeech 2026
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explic...
Sathvik Udupa, Shinji Watanabe, Petr Schwarz ... · Interspeech 2026
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based...