Audio ML Papers

Last 7 Days (June 19 - June 26, 2026)

Subcategories: All (27) | Speech Synthesis (1) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (3) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (23)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 81)
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal a...
#2 TOP PAPER (Score: 80)
Lianbo Liu, Shiao Zhu, Kai Washizaki ... · arXiv
While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique linguistic challenges, such as widespread context-depende...
#3 TOP PAPER (Score: 78)
Sitong Cheng, Weizhen Bian, Songjun Cao ... · arXiv
Speech-to-speech translation (S2ST) should preserve not only lexical meaning, but also expressive attributes: emotion, scenario style (e.g., news reporting vs. dramatic dialogue), and nonverbal vocalizations (NVs). Moreover, collecting cross-lingual target speech that is both tra...
Thursday, June 25, 2026
Adhiraj Banerjee, Vipul Arora · INTERSPEECH 2026
Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering a...
Xinyu Liang, Fredrik Cumlin, Victor Ungureanu ... · Interspeech 2026
We introduce DNSMOS-C, a compact end-to-end speech quality assessment model that extends the DNSMOS Pro framework by integrating a MOS-guided triplet-based contrastive loss. Applied directly to the intermediate embeddings, this contrastive supervision encourages the latent space ...
Tianxin Xie, Chenxing Li, Dong Yu ... · Interspeech 2026
Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datas...
Wednesday, June 24, 2026
Pengfei Zhang, Hoang H Nguyen, Kazi Shaharair Sharif ... · arXiv
Recent Large Audio Language Models (LALMs) have achieved remarkable progress in audio perceptual tasks across individual acoustic layers, including speech, sound, and music. However, existing benchmarks predominantly evaluate these layers in isolation, overlooking the complex con...
Yicheng Gu, Junan Zhang, Jerry Li ... · arXiv
Self-supervised learning (SSL) has emerged as an essential paradigm for music information retrieval (MIR). While current SSL models achieve state-of-the-art performance across various MIR tasks, they typically treat audio as 1D sequences, either operating on time-domain waveforms...
Tuesday, June 23, 2026
Wonchul Shin, Inyong Choi, Kyogu Lee · Interspeech 2026
Recent end-to-end models for EEG-guided target speech extraction report impressive results, underscoring potential for neuro-steered hearing technologies. However, our analysis reveals that high within-trial performance can be driven by trial-specific EEG structure that acts as s...
Jisu Jeon, Seungyeon Jwa, Joosung Lee ... · Interspeech 2026
Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPair...
Jaeyong Lee, Masato Mimura, Takafumi Moriya · Interspeech 2026
Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice...
Abinay Reddy Naini, Jaeyeon Kim, Chao-Han Huck Yang ... · arXiv
Large audio-language models (LALMs) can reason about audio, yet it remains unclear whether they can perform comparative judgments between two speech signals along emotional, environmental, linguistic, prosodic, and interpersonal dimensions. We study this question in the context o...
Monday, June 22, 2026
Haoxu Wang, Biao Tian, Weiqing Li ... · Interspeech 2026
Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajecto...
Chun-Wei Chen, Tzu-Quan Lin, Ke-Han Lu ... · Interspeech 2026
Speech Language Models achieve reasoning capabilities, but are often hindered by massive parameter counts and a tendency to prioritize linguistic priors over acoustic features. While contrastive decoding enhances grounding by contrasting audio-aware and text-only logits, it incre...
Huadai Liu, Wen Wang, Kaicheng Luo ... · ICML 2026
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling high-fidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs fac...
Sunday, June 21, 2026
Yichen Xu · arXiv
Generative music systems can now produce impressive audio from text prompts, but audio outputs are difficult to inspect, edit, and diagnose as musical structure. We introduce Libretto, an agent-facing framework for symbolic music generation and revision. Libretto uses an LLM-nati...
Shuubham Ojha, Carol Espy-Wilson · arXiv
Diffusion models show potential for speech enhancement but lack linguistic guidance. We condition a diffusion-based model on wav2vec 2.0 features from noisy input, injected at the U-Net bottleneck via Feature-wise Linear Modulation (FiLM). Phonetic representations from wav2vec 2....
Saturday, June 20, 2026
Dongmei Wang, Xiaohang Sun, Yang Liu ... · Interspeech 2026
We propose AugCodec, a low-bitrate disentangled neural speech codec that leverages data augmentation to decompose speech into three distinct components: semantic, speaker, and prosody tokens. Specifically, we employ tailored augmenta tion strategies to transform speech into disti...
Byoungjun So, Jaejun Lee, Kyogu Lee · Interspeech 2026
Understanding speaker attributes is crucial for voice-related applications, yet conventional approaches rely on fixed categorical labels, lacking semantic richness and zero-shot generalizability. We propose a novel framework for open-set speaker attribute prediction leveraging La...
Jeongsoo Choi, Ji-Hoon Kim, Shujie Hu ... · arXiv (preprint)
Neural speech codecs efficiently compress speech and have become a foundation for speech generation, but they are typically learned as holistic representations that intertwine linguistic content, speaker identity, and prosody. While this design is effective for zero-shot voice cl...
Masao Someki, Alexander Polok, Carlos Carvalho ... · Interspeech 2026
Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a m...
Yaozhong Kang, Jiang Wang, Runwu Shi ... · Interspeech 2026
Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT's mathematical steps into diagnostic targets, probing hidden layers of three ...
Friday, June 19, 2026
Ariadna Sanchez, Christoph Minixhofer, Korin Richmond ... · Interspeech 2026
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and spea...
Xun Gong, Jinchuan Tian, Haoran Wang ... · InterSpeech 2026
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We r...
Hounsu Kim, Juhan Nam · Interspeech 2026
Speaker-decoupled speech codecs can reduce bitrate by separating global speaker attributes from local content and prosody, while supporting voice conversion. Existing speaker-decoupled codecs face a trade-off: methods that explicitly suppress speaker leakage often rely on multi-s...
Tzu-Chieh Wei, Yi-Cheng Lin, Huang-Cheng Chou ... · INTERSPEECH 2026
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal seg...
Jingwen Zhou, Mingzhe Wang · arXiv
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. ...