Audio ML Papers

Last 7 Days (June 17 - June 24, 2026)

Subcategories: All (30) | Speech Synthesis (2) | Music Synthesis (1) | Ambient Synthesis (0) | Quality Evaluation (2) | Enhancement (4) | Asr (1) | Llm Audio (1) | Midi Generation (0) | Generative Conditioning (1) | Other (18)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 81)
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal a...
#2 TOP PAPER (Score: 79)
Michael Finkelson, Daniel Segal, Eitan Richardson ... · arXiv
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambie...
#3 TOP PAPER (Score: 76)
Ariadna Sanchez, Christoph Minixhofer, Korin Richmond ... · Interspeech 2026
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and spea...
Tuesday, June 23, 2026
Wonchul Shin, Inyong Choi, Kyogu Lee · Interspeech 2026
Recent end-to-end models for EEG-guided target speech extraction report impressive results, underscoring potential for neuro-steered hearing technologies. However, our analysis reveals that high within-trial performance can be driven by trial-specific EEG structure that acts as s...
Jaeyong Lee, Masato Mimura, Takafumi Moriya · Interspeech 2026
Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice...
Abinay Reddy Naini, Jaeyeon Kim, Chao-Han Huck Yang ... · arXiv
Large audio-language models (LALMs) can reason about audio, yet it remains unclear whether they can perform comparative judgments between two speech signals along emotional, environmental, linguistic, prosodic, and interpersonal dimensions. We study this question in the context o...
Baisen Wang, Chenxi Bao, Qisong Han · arXiv
Interactive music and live performance relies on real-time human expression, but modern generative music AI remains largely absent from this domain due to its prohibitive inference latency and offline rendering paradigm. To provide pioneer musicians with a novel medium for intera...
Gabriel Clark, Sofian Mejjoute, Mohamed Osman ... · arXiv
We present ZONOS2 8B, our latest TTS model, which achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. We improve upon Zonos-v0.1 across scale, data, and training recipe. We scale the model from 1.6B to 8B total parameters (900M active) with a novel mixture...
Monday, June 22, 2026
Haoxu Wang, Biao Tian, Weiqing Li ... · Interspeech 2026
Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajecto...
Chun-Wei Chen, Tzu-Quan Lin, Ke-Han Lu ... · Interspeech 2026
Speech Language Models achieve reasoning capabilities, but are often hindered by massive parameter counts and a tendency to prioritize linguistic priors over acoustic features. While contrastive decoding enhances grounding by contrasting audio-aware and text-only logits, it incre...
Huadai Liu, Wen Wang, Kaicheng Luo ... · ICML 2026
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling high-fidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs fac...
Sunday, June 21, 2026
Yichen Xu · arXiv
Generative music systems can now produce impressive audio from text prompts, but audio outputs are difficult to inspect, edit, and diagnose as musical structure. We introduce Libretto, an agent-facing framework for symbolic music generation and revision. Libretto uses an LLM-nati...
Shuubham Ojha, Carol Espy-Wilson · arXiv
Diffusion models show potential for speech enhancement but lack linguistic guidance. We condition a diffusion-based model on wav2vec 2.0 features from noisy input, injected at the U-Net bottleneck via Feature-wise Linear Modulation (FiLM). Phonetic representations from wav2vec 2....
Saturday, June 20, 2026
Dongmei Wang, Xiaohang Sun, Yang Liu ... · Interspeech 2026
We propose AugCodec, a low-bitrate disentangled neural speech codec that leverages data augmentation to decompose speech into three distinct components: semantic, speaker, and prosody tokens. Specifically, we employ tailored augmenta tion strategies to transform speech into disti...
Byoungjun So, Jaejun Lee, Kyogu Lee · Interspeech 2026
Understanding speaker attributes is crucial for voice-related applications, yet conventional approaches rely on fixed categorical labels, lacking semantic richness and zero-shot generalizability. We propose a novel framework for open-set speaker attribute prediction leveraging La...
Jeongsoo Choi, Ji-Hoon Kim, Shujie Hu ... · arXiv (preprint)
Neural speech codecs efficiently compress speech and have become a foundation for speech generation, but they are typically learned as holistic representations that intertwine linguistic content, speaker identity, and prosody. While this design is effective for zero-shot voice cl...
Masao Someki, Alexander Polok, Carlos Carvalho ... · Interspeech 2026
Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a m...
Yaozhong Kang, Jiang Wang, Runwu Shi ... · Interspeech 2026
Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT's mathematical steps into diagnostic targets, probing hidden layers of three ...
Friday, June 19, 2026
Xun Gong, Jinchuan Tian, Haoran Wang ... · InterSpeech 2026
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We r...
Hounsu Kim, Juhan Nam · Interspeech 2026
Speaker-decoupled speech codecs can reduce bitrate by separating global speaker attributes from local content and prosody, while supporting voice conversion. Existing speaker-decoupled codecs face a trade-off: methods that explicitly suppress speaker leakage often rely on multi-s...
Tzu-Chieh Wei, Yi-Cheng Lin, Huang-Cheng Chou ... · INTERSPEECH 2026
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal seg...
Jingwen Zhou, Mingzhe Wang · arXiv
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. ...
Thursday, June 18, 2026
SooHwan Eom, Hee Suk Yoon, Eunseop Yoon ... · Interspeech 2026
Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most need...
Yunsik Kim, Yoonyoung Chung · Interspeech 2026
Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter....
Masato Takagi, Masaya Kawamura, Reo Shimizu ... · INTERSPEECH 2026
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradatio...
Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui ... · INTERSPEECH 2026
Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To ...
Yudong Li, Zihao Fang, Junwen Qiu ... · Interspeech 2026
Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing mode...
Wednesday, June 17, 2026
Sakshi Joshi, Dhruv Subhash Rathi, Sanskar Singh ... · Interspeech 2026
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot a...
Shuoyi Zhou, Yixuan Zhou, Peiji Yang ... · Interspeech 2026
Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling glo...
Yizhuo Yang, Junqiao Fan, Shenghai Yuan ... · arXiv
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degr...