Audio ML Papers

Week of June 14 - June 21, 2026

Subcategories: All (28) | Speech Synthesis (3) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Evaluation (2) | Enhancement (5) | Asr (2) | Llm Audio (1) | Midi Generation (0) | Generative Conditioning (1) | Other (12)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 84)
Yonghyun Kim, Junwon Lee, Haiwen Xia ... · arXiv
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) vote...
#2 TOP PAPER (Score: 84)
Salman Hussain Ali, Umberto Cappellazzo, Mirco Ravanelli · Interspeech 2026
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Mea...
#3 TOP PAPER (Score: 83)
Alex Gichamba, Moise Busogi · Interspeech 2026
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate d...
Saturday, June 20, 2026
Dongmei Wang, Xiaohang Sun, Yang Liu ... · Interspeech 2026
We propose AugCodec, a low-bitrate disentangled neural speech codec that leverages data augmentation to decompose speech into three distinct components: semantic, speaker, and prosody tokens. Specifically, we employ tailored augmenta tion strategies to transform speech into disti...
Byoungjun So, Jaejun Lee, Kyogu Lee · Interspeech 2026
Understanding speaker attributes is crucial for voice-related applications, yet conventional approaches rely on fixed categorical labels, lacking semantic richness and zero-shot generalizability. We propose a novel framework for open-set speaker attribute prediction leveraging La...
Jeongsoo Choi, Ji-Hoon Kim, Shujie Hu ... · arXiv (preprint)
Neural speech codecs efficiently compress speech and have become a foundation for speech generation, but they are typically learned as holistic representations that intertwine linguistic content, speaker identity, and prosody. While this design is effective for zero-shot voice cl...
Masao Someki, Alexander Polok, Carlos Carvalho ... · Interspeech 2026
Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a m...
Yaozhong Kang, Jiang Wang, Runwu Shi ... · Interspeech 2026
Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT's mathematical steps into diagnostic targets, probing hidden layers of three ...
Friday, June 19, 2026
Ariadna Sanchez, Christoph Minixhofer, Korin Richmond ... · Interspeech 2026
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and spea...
Xun Gong, Jinchuan Tian, Haoran Wang ... · InterSpeech 2026
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We r...
Hounsu Kim, Juhan Nam · Interspeech 2026
Speaker-decoupled speech codecs can reduce bitrate by separating global speaker attributes from local content and prosody, while supporting voice conversion. Existing speaker-decoupled codecs face a trade-off: methods that explicitly suppress speaker leakage often rely on multi-s...
Tzu-Chieh Wei, Yi-Cheng Lin, Huang-Cheng Chou ... · INTERSPEECH 2026
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal seg...
Jingwen Zhou, Mingzhe Wang · arXiv
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. ...
Thursday, June 18, 2026
SooHwan Eom, Hee Suk Yoon, Eunseop Yoon ... · Interspeech 2026
Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most need...
Yunsik Kim, Yoonyoung Chung · Interspeech 2026
Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter....
Masato Takagi, Masaya Kawamura, Reo Shimizu ... · INTERSPEECH 2026
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradatio...
Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui ... · INTERSPEECH 2026
Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To ...
Yudong Li, Zihao Fang, Junwen Qiu ... · Interspeech 2026
Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing mode...
Wednesday, June 17, 2026
Michael Finkelson, Daniel Segal, Eitan Richardson ... · arXiv
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambie...
Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh ... · Interspeech 2026
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot a...
Shuoyi Zhou, Yixuan Zhou, Peiji Yang ... · Interspeech 2026
Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling glo...
Yizhuo Yang, Junqiao Fan, Shenghai Yuan ... · arXiv
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degr...
Tuesday, June 16, 2026
Alexander Polok, Samuele Cornell, Sathvik Udupa ... · Interspeech 2026
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization ...
Zheqi Dai, Guangyan Zhang, Zhen Ye ... · arXiv
Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional ...
Lichen Bai, Tianhao Zhang, Shitong Shao ... · arXiv
As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models...
Monday, June 15, 2026
Dong Yang, Yuki Saito, Wataru Nakata ... · arXiv
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level ...
Haotian Qi, Gabriel Skantze · arXiv
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic ...
Sunday, June 14, 2026
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar ... · IJCAI-ECAI 2026
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, caus...