Audio ML Papers

Last 7 Days (June 18 - June 25, 2026)

Subcategories: All (30) | Speech Synthesis (2) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Evaluation (2) | Enhancement (4) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (1) | Other (19)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 81)
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal a...
#2 TOP PAPER (Score: 80)
Lianbo Liu, Shiao Zhu, Kai Washizaki ... · arXiv
While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique linguistic challenges, such as widespread context-depende...
#3 TOP PAPER (Score: 78)
Sitong Cheng, Weizhen Bian, Songjun Cao ... · arXiv
Speech-to-speech translation (S2ST) should preserve not only lexical meaning, but also expressive attributes: emotion, scenario style (e.g., news reporting vs. dramatic dialogue), and nonverbal vocalizations (NVs). Moreover, collecting cross-lingual target speech that is both tra...
Wednesday, June 24, 2026
Pengfei Zhang, Hoang H Nguyen, Kazi Shaharair Sharif ... · arXiv
Recent Large Audio Language Models (LALMs) have achieved remarkable progress in audio perceptual tasks across individual acoustic layers, including speech, sound, and music. However, existing benchmarks predominantly evaluate these layers in isolation, overlooking the complex con...
Rotem Rousso, Eyal Cohen, Joseph Keshet · arXiv (preprint)
Recent advances in sequence modeling have significantly improved ASR systems, bringing them close to human-level recognition accuracy and enhancing robustness across diverse acoustic conditions and languages. In contrast, Forced Alignment has not experienced comparable progress, ...
Szu-Wei Fu, Rong Chao, Xuesong Yang ... · arXiv (Preprint)
Different real-time speech applications impose distinct latency budgets, often requiring separately trained enhancement models for each scenario. In this paper, we propose a one-for-all, real-time universal speech enhancement model that provides explicit control over both algorit...
Tuesday, June 23, 2026
Wonchul Shin, Inyong Choi, Kyogu Lee · Interspeech 2026
Recent end-to-end models for EEG-guided target speech extraction report impressive results, underscoring potential for neuro-steered hearing technologies. However, our analysis reveals that high within-trial performance can be driven by trial-specific EEG structure that acts as s...
Jisu Jeon, Seungyeon Jwa, Joosung Lee ... · Interspeech 2026
Large Audio-Language Models (LALMs) have been widely used as judge models for the automatic evaluation of generated speech. However, prior approaches predominantly focus on holistic naturalness, leaving fine-grained paralinguistic distinctions underexplored. We introduce ParaPair...
Jaeyong Lee, Masato Mimura, Takafumi Moriya · Interspeech 2026
Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice...
Abinay Reddy Naini, Jaeyeon Kim, Chao-Han Huck Yang ... · arXiv
Large audio-language models (LALMs) can reason about audio, yet it remains unclear whether they can perform comparative judgments between two speech signals along emotional, environmental, linguistic, prosodic, and interpersonal dimensions. We study this question in the context o...
Monday, June 22, 2026
Haoxu Wang, Biao Tian, Weiqing Li ... · Interspeech 2026
Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajecto...
Chun-Wei Chen, Tzu-Quan Lin, Ke-Han Lu ... · Interspeech 2026
Speech Language Models achieve reasoning capabilities, but are often hindered by massive parameter counts and a tendency to prioritize linguistic priors over acoustic features. While contrastive decoding enhances grounding by contrasting audio-aware and text-only logits, it incre...
Huadai Liu, Wen Wang, Kaicheng Luo ... · ICML 2026
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling high-fidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs fac...
Sunday, June 21, 2026
Yichen Xu · arXiv
Generative music systems can now produce impressive audio from text prompts, but audio outputs are difficult to inspect, edit, and diagnose as musical structure. We introduce Libretto, an agent-facing framework for symbolic music generation and revision. Libretto uses an LLM-nati...
Shuubham Ojha, Carol Espy-Wilson · arXiv
Diffusion models show potential for speech enhancement but lack linguistic guidance. We condition a diffusion-based model on wav2vec 2.0 features from noisy input, injected at the U-Net bottleneck via Feature-wise Linear Modulation (FiLM). Phonetic representations from wav2vec 2....
Saturday, June 20, 2026
Dongmei Wang, Xiaohang Sun, Yang Liu ... · Interspeech 2026
We propose AugCodec, a low-bitrate disentangled neural speech codec that leverages data augmentation to decompose speech into three distinct components: semantic, speaker, and prosody tokens. Specifically, we employ tailored augmenta tion strategies to transform speech into disti...
Byoungjun So, Jaejun Lee, Kyogu Lee · Interspeech 2026
Understanding speaker attributes is crucial for voice-related applications, yet conventional approaches rely on fixed categorical labels, lacking semantic richness and zero-shot generalizability. We propose a novel framework for open-set speaker attribute prediction leveraging La...
Jeongsoo Choi, Ji-Hoon Kim, Shujie Hu ... · arXiv (preprint)
Neural speech codecs efficiently compress speech and have become a foundation for speech generation, but they are typically learned as holistic representations that intertwine linguistic content, speaker identity, and prosody. While this design is effective for zero-shot voice cl...
Masao Someki, Alexander Polok, Carlos Carvalho ... · Interspeech 2026
Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a m...
Yaozhong Kang, Jiang Wang, Runwu Shi ... · Interspeech 2026
Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT's mathematical steps into diagnostic targets, probing hidden layers of three ...
Friday, June 19, 2026
Ariadna Sanchez, Christoph Minixhofer, Korin Richmond ... · Interspeech 2026
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and spea...
Xun Gong, Jinchuan Tian, Haoran Wang ... · InterSpeech 2026
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We r...
Hounsu Kim, Juhan Nam · Interspeech 2026
Speaker-decoupled speech codecs can reduce bitrate by separating global speaker attributes from local content and prosody, while supporting voice conversion. Existing speaker-decoupled codecs face a trade-off: methods that explicitly suppress speaker leakage often rely on multi-s...
Tzu-Chieh Wei, Yi-Cheng Lin, Huang-Cheng Chou ... · INTERSPEECH 2026
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal seg...
Jingwen Zhou, Mingzhe Wang · arXiv
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. ...
Thursday, June 18, 2026
SooHwan Eom, Hee Suk Yoon, Eunseop Yoon ... · Interspeech 2026
Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most need...
Yunsik Kim, Yoonyoung Chung · Interspeech 2026
Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter....
Masato Takagi, Masaya Kawamura, Reo Shimizu ... · INTERSPEECH 2026
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradatio...
Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui ... · INTERSPEECH 2026
Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To ...
Yudong Li, Zihao Fang, Junwen Qiu ... · Interspeech 2026
Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing mode...