Audio ML Papers

Last 7 Days (June 20 - June 27, 2026)

Subcategories: All (25) | Speech Synthesis (0) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (2) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (23)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 81)
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal a...
#2 TOP PAPER (Score: 80)
Lianbo Liu, Shiao Zhu, Kai Washizaki ... · arXiv
While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique linguistic challenges, such as widespread context-depende...
#3 TOP PAPER (Score: 78)
Sitong Cheng, Weizhen Bian, Songjun Cao ... · arXiv
Speech-to-speech translation (S2ST) should preserve not only lexical meaning, but also expressive attributes: emotion, scenario style (e.g., news reporting vs. dramatic dialogue), and nonverbal vocalizations (NVs). Moreover, collecting cross-lingual target speech that is both tra...
Friday, June 26, 2026
Yiming Sun, Chen Chen, Zifan Zhou ... · arXiv (Preprint)
Phone-use Agents can execute complex tasks end to end across real mobile applications. By operating a real device on the user's behalf, they reach far more functionalities than CLI agents, which amplifies the real-world harm they can cause when driven for malicious purposes. We p...
Sihang Nie, Xiaofen Xing, Rui Xing ... · arXiv
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimiz...
Thursday, June 25, 2026
Zahra Omidi, John H. L. Hansen · ICASSP 2026
The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effort lies on a continuum, adjacen...
Adhiraj Banerjee, Vipul Arora · INTERSPEECH 2026
Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering a...
Xinyu Liang, Fredrik Cumlin, Victor Ungureanu ... · Interspeech 2026
We introduce DNSMOS-C, a compact end-to-end speech quality assessment model that extends the DNSMOS Pro framework by integrating a MOS-guided triplet-based contrastive loss. Applied directly to the intermediate embeddings, this contrastive supervision encourages the latent space ...
Tianxin Xie, Chenxing Li, Dong Yu ... · Interspeech 2026
Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datas...
Wednesday, June 24, 2026
Pengfei Zhang, Hoang H Nguyen, Kazi Shaharair Sharif ... · arXiv
Recent Large Audio Language Models (LALMs) have achieved remarkable progress in audio perceptual tasks across individual acoustic layers, including speech, sound, and music. However, existing benchmarks predominantly evaluate these layers in isolation, overlooking the complex con...
Yicheng Gu, Junan Zhang, Jerry Li ... · arXiv
Self-supervised learning (SSL) has emerged as an essential paradigm for music information retrieval (MIR). While current SSL models achieve state-of-the-art performance across various MIR tasks, they typically treat audio as 1D sequences, either operating on time-domain waveforms...
Tuesday, June 23, 2026
Wonchul Shin, Inyong Choi, Kyogu Lee · Interspeech 2026
Recent end-to-end models for EEG-guided target speech extraction report impressive results, underscoring potential for neuro-steered hearing technologies. However, our analysis reveals that high within-trial performance can be driven by trial-specific EEG structure that acts as s...
Jisu Jeon, Seungyeon Jwa, Joosung Lee ... · Interspeech 2026
Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPair...
Jaeyong Lee, Masato Mimura, Takafumi Moriya · Interspeech 2026
Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice...
Abinay Reddy Naini, Jaeyeon Kim, Chao-Han Huck Yang ... · arXiv
Large audio-language models (LALMs) can reason about audio, yet it remains unclear whether they can perform comparative judgments between two speech signals along emotional, environmental, linguistic, prosodic, and interpersonal dimensions. We study this question in the context o...
Monday, June 22, 2026
Haoxu Wang, Biao Tian, Weiqing Li ... · Interspeech 2026
Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajecto...
Chun-Wei Chen, Tzu-Quan Lin, Ke-Han Lu ... · Interspeech 2026
Speech Language Models achieve reasoning capabilities, but are often hindered by massive parameter counts and a tendency to prioritize linguistic priors over acoustic features. While contrastive decoding enhances grounding by contrasting audio-aware and text-only logits, it incre...
Huadai Liu, Wen Wang, Kaicheng Luo ... · ICML 2026
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling high-fidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs fac...
Sunday, June 21, 2026
Yichen Xu · arXiv
Generative music systems can now produce impressive audio from text prompts, but audio outputs are difficult to inspect, edit, and diagnose as musical structure. We introduce Libretto, an agent-facing framework for symbolic music generation and revision. Libretto uses an LLM-nati...
Shuubham Ojha, Carol Espy-Wilson · arXiv
Diffusion models show potential for speech enhancement but lack linguistic guidance. We condition a diffusion-based model on wav2vec 2.0 features from noisy input, injected at the U-Net bottleneck via Feature-wise Linear Modulation (FiLM). Phonetic representations from wav2vec 2....
Saturday, June 20, 2026
Dongmei Wang, Xiaohang Sun, Yang Liu ... · Interspeech 2026
We propose AugCodec, a low-bitrate disentangled neural speech codec that leverages data augmentation to decompose speech into three distinct components: semantic, speaker, and prosody tokens. Specifically, we employ tailored augmenta tion strategies to transform speech into disti...
Byoungjun So, Jaejun Lee, Kyogu Lee · Interspeech 2026
Understanding speaker attributes is crucial for voice-related applications, yet conventional approaches rely on fixed categorical labels, lacking semantic richness and zero-shot generalizability. We propose a novel framework for open-set speaker attribute prediction leveraging La...
Jeongsoo Choi, Ji-Hoon Kim, Shujie Hu ... · arXiv (preprint)
Neural speech codecs efficiently compress speech and have become a foundation for speech generation, but they are typically learned as holistic representations that intertwine linguistic content, speaker identity, and prosody. While this design is effective for zero-shot voice cl...
Masao Someki, Alexander Polok, Carlos Carvalho ... · Interspeech 2026
Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a m...
Yaozhong Kang, Jiang Wang, Runwu Shi ... · Interspeech 2026
Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT's mathematical steps into diagnostic targets, probing hidden layers of three ...