Audio ML Papers

Last 7 Days (June 24 - July 01, 2026)

Subcategories: All (24) | Speech Synthesis (0) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (0) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (24)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 81)
Yujie Tu, Yifan Yang, Tianrui Wang ... · arXiv
While modern ASR systems achieve low error rates on high-resource benchmarks, such performance often overestimates real-world robustness. Existing evaluations address challenges in isolation, lacking a unified benchmark for domain terminology, age variation, dialects, accents, an...
#2 TOP PAPER (Score: 80)
Lianbo Liu, Shiao Zhu, Kai Washizaki ... · arXiv
While large language model (LLM)-based text-to-speech (TTS) systems have achieved high-quality speech synthesis, most existing systems focus on English and Chinese. Japanese, however, remains under-explored, and its unique linguistic challenges, such as widespread context-depende...
#3 TOP PAPER (Score: 78)
Sitong Cheng, Weizhen Bian, Songjun Cao ... · arXiv
Speech-to-speech translation (S2ST) should preserve not only lexical meaning, but also expressive attributes: emotion, scenario style (e.g., news reporting vs. dramatic dialogue), and nonverbal vocalizations (NVs). Moreover, collecting cross-lingual target speech that is both tra...
Tuesday, June 30, 2026
Philipp Grundhuber, Emanuël A. P. Habets · Interspeech 2026
Some neural audio codecs disentangle speech into latent subspaces encoding content, speaker identity, and acoustics, enabling acoustic teleportation and voice conversion. Existing evaluations rely on cross-reconstruction quality, which cannot reliably detect leakage across partit...
Chuanbo Zhu, Wuyou Zhou, Rongxiu Zhong ... · arXiv
Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion editing as separate tasks, limiting both editing granula...
Yujun Lee, Joonhyeok Shin, Hyoeun Kim ... · Workshop on Machine Learning for Audio, ICML 2026
Recent music audio-language models achieve high accuracy on instrument question-answering benchmarks, but it remains unclear whether this reflects robust audio grounding or benchmark-specific shortcuts. In this paper, we introduce an OpenMIC-derived diagnostic benchmark sequence ...
Monday, June 29, 2026
Yoonjeong Park, Jaekwon Im, Juhan Nam · Interspeech 2026
Text-based singing voice editing (SVE) aims to revise sung lyrics while preserving the original melody, total duration, and non-edited regions. In this paper, we propose MeloDISinger, a flow-matching-based SVE model for melody-aware and duration-preserving editing. Its core modul...
Qiyang Sun, Yi Chang, Zixing Zhang ... · arXiv (preprint)
Speech conveys rich emotional information. As Speech Emotion Recognition (SER) is usually deployed in privacy-sensitive and reliability-critical environments, adversarial attacks on SER have attracted increasing attention. Existing sparse attacks control the number of perturbed e...
Pranav Tushar, Xiao Xiao Miao, Rong Tong · INTERSPEECH 2026
Voice anonymization aims to protect speaker identity while preserving linguistic content and speech usability. However, most anonymization systems are developed on adult speech, leading to degraded performance when applied to child speech. This paper investigates child-centric an...
Sunday, June 28, 2026
Hoyeol Sohn, Juhan Nam · INTERSPEECH 2026
Variable frame rate (VFR) coding has recently emerged in neural speech codecs, allocating fewer frames to redundant regions and more frames to rapidly changing speech. VFR must transmit side information about retained time steps, but prior gains are either not rigorously addresse...
Qinzhe Hu, Chenda Li, Wangyou Zhang ... · Interspeech 2026
Recent advances in speech separation (SS) have led to compact front-end models with small parameter sizes, yet their high computational cost remains a major barrier for deployment on edge devices. To address this, we propose TF-MoE, a sparse Mixture-of-Experts (MoE) framework tha...
Yichi Wang, Junzhe Chen, Wangjin Zhou ... · arXiv
In long-form multi-party conversations, highly imbalanced speaker activity and frequent overlap make it difficult to identify "who spoke when and what". Sliding-window continuous speech separation (CSS) mitigates sparse supervision, but often suffers from cross-window speaker inc...
Sujin Koo, Sangyoon Kim, Ji Sub Um ... · Interspeech 2026
Noise-robust bandwidth expansion aims to reconstruct high-fidelity wideband speech from noisy low-resolution inputs. While flow matching has shown strong performance in speech generation, accurately recovering clean speech from noisy inputs remains challenging due to the ambiguit...
Piyush Arora, Navlika Singh, Umberto Cappellazzo ... · INTERSPEECH 2026
Audio-Visual Speech Recognition takes two input modalities, acoustic and visual streams, where visual information from lip movements aids recognition when audio is noisy. Recently, LLM-based AVSR models have emerged as a promising paradigm by connecting pre-trained audio-visual e...
Saturday, June 27, 2026
Fengjie Lu, Chenang Jiang, Jiarui Hai ... · arXiv
Recent advances in language--audio retrieval have been largely driven by contrastive dual-encoder architectures that align audio and text in a shared embedding space. While effective, existing retrieval embeddings are primarily optimized for audio--caption matching, limiting thei...
Friday, June 26, 2026
Yiming Sun, Chen Chen, Zifan Zhou ... · arXiv (Preprint)
Phone-use Agents can execute complex tasks end to end across real mobile applications. By operating a real device on the user's behalf, they reach far more functionalities than CLI agents, which amplifies the real-world harm they can cause when driven for malicious purposes. We p...
Sihang Nie, Xiaofen Xing, Rui Xing ... · arXiv
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimiz...
Jonghyeon Park, Olivier Jiyoun Jung, Myungwoo Oh · INTERSPEECH 2026
Early detection of dementia enables timely intervention, and reflecting cognitive impairment, spontaneous speech offers a non-invasive screening modality. Conventional approaches often focus on a single representational dimension -- such as acoustic descriptors, pause modeling, a...
Thursday, June 25, 2026
Zahra Omidi, John H. L. Hansen · ICASSP 2026
The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effort lies on a continuum, adjacen...
Adhiraj Banerjee, Vipul Arora · INTERSPEECH 2026
Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering a...
Xinyu Liang, Fredrik Cumlin, Victor Ungureanu ... · Interspeech 2026
We introduce DNSMOS-C, a compact end-to-end speech quality assessment model that extends the DNSMOS Pro framework by integrating a MOS-guided triplet-based contrastive loss. Applied directly to the intermediate embeddings, this contrastive supervision encourages the latent space ...
Tianxin Xie, Chenxing Li, Dong Yu ... · Interspeech 2026
Recently, zero-shot text-to-speech (TTS) has enabled high-fidelity and expressive speech synthesis, but it often fails to imitate unseen speaking styles from uncommon scenarios (e.g., crosstalk, dialects). Moreover, fine-tuning pretrained models requires large, high-quality datas...
Wednesday, June 24, 2026
Pengfei Zhang, Hoang H Nguyen, Kazi Shaharair Sharif ... · arXiv
Recent Large Audio Language Models (LALMs) have achieved remarkable progress in audio perceptual tasks across individual acoustic layers, including speech, sound, and music. However, existing benchmarks predominantly evaluate these layers in isolation, overlooking the complex con...
Yicheng Gu, Junan Zhang, Jerry Li ... · arXiv
Self-supervised learning (SSL) has emerged as an essential paradigm for music information retrieval (MIR). While current SSL models achieve state-of-the-art performance across various MIR tasks, they typically treat audio as 1D sequences, either operating on time-domain waveforms...