Audio ML Papers

Last 7 Days (June 12 - June 19, 2026)

Subcategories: All (24) | Speech Synthesis (9) | Music Synthesis (1) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (4) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (7)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 88)
Junlong Tong, Wenqi Xu, Yingqi Fan ... · arXiv
Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a con...
#2 TOP PAPER (Score: 84)
Yonghyun Kim, Junwon Lee, Haiwen Xia ... · arXiv
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) vote...
#3 TOP PAPER (Score: 84)
Salman Hussain Ali, Umberto Cappellazzo, Mirco Ravanelli · Interspeech 2026
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Mea...
Thursday, June 18, 2026
SooHwan Eom, Hee Suk Yoon, Eunseop Yoon ... · Interspeech 2026
Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most need...
Yunsik Kim, Yoonyoung Chung · Interspeech 2026
Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter....
Masato Takagi, Masaya Kawamura, Reo Shimizu ... · INTERSPEECH 2026
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradatio...
Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui ... · INTERSPEECH 2026
Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To ...
Yudong Li, Zihao Fang, Junwen Qiu ... · Interspeech 2026
Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing mode...
Wednesday, June 17, 2026
Michael Finkelson, Daniel Segal, Eitan Richardson ... · arXiv
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambie...
Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh ... · Interspeech 2026
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot a...
Shuoyi Zhou, Yixuan Zhou, Peiji Yang ... · Interspeech 2026
Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling glo...
Yizhuo Yang, Junqiao Fan, Shenghai Yuan ... · arXiv
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degr...
Tuesday, June 16, 2026
Alexander Polok, Samuele Cornell, Sathvik Udupa ... · Interspeech 2026
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization ...
Zheqi Dai, Guangyan Zhang, Zhen Ye ... · arXiv
Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional ...
Lichen Bai, Tianhao Zhang, Shitong Shao ... · arXiv
As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models...
Monday, June 15, 2026
Dong Yang, Yuki Saito, Wataru Nakata ... · arXiv
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level ...
Haotian Qi, Gabriel Skantze · arXiv
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic ...
Alex Gichamba, Moise Busogi · Interspeech 2026
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate d...
Sunday, June 14, 2026
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar ... · IJCAI-ECAI 2026
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, caus...
Saturday, June 13, 2026
Liming Wang, Cody Karjadi, Rhoda Au ... · arXiv
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held ...
Zhenwei Mou, Weili Jiang, Liping Chen ... · INTERSPEECH 2026
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking...
Zhenwei Mou, Liping Chen, Yajun Hu ... · INTERSPEECH 2026
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient ...
Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen · arXiv
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phonem...
Friday, June 12, 2026
Hui Geng, Yi Su, Han Yin ... · arXiv
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quali...