Audio ML Papers

Last 7 Days (June 25 - July 02, 2026)

Subcategories: All (22) | Speech Synthesis (0) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (1) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (21)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 81)
Yujie Tu, Yifan Yang, Tianrui Wang ... ยท arXiv
While modern ASR systems achieve low error rates on high-resource benchmarks, such performance often overestimates real-world robustness. Existing evaluations address challenges in isolation, lacking a unified benchmark for domain terminology, age variation, dialects, accents, an...
#2 TOP PAPER (Score: 75)
Beatrice Savoldi, Sara Papi, Wafa Aissa ... ยท arXiv
Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting practices across state-of-the-art speech model rele...
#3 TOP PAPER (Score: 74)
Yiming Sun, Chen Chen, Zifan Zhou ... ยท arXiv (Preprint)
Phone-use Agents can execute complex tasks end to end across real mobile applications. By operating a real device on the user's behalf, they reach far more functionalities than CLI agents, which amplifies the real-world harm they can cause when driven for malicious purposes. We p...
Wednesday, July 01, 2026
Yibo Bai, Sizhou Chen, Michele Panariello ... ยท IEEE/ACM Transactions on Audio, Speech, and Language Processing (Inferred from "Journal of Class Files... August 2021" and IEEE keywords, though likely an arXiv preprint version of a journal submission)
Modern automatic speaker verification (ASV) systems are vulnerable to adversarial perturbations. Diffusion-based purification has recently shown strong effectiveness against such perturbations, but its reverse denoising process requires iterative sampling and leads to high infere...
Siyi Wang, James Bailey, Ting Dang ยท arXiv (Submitted to ICML 2026 based on footer)
While prior work has explored emotion control in hybrid text-to-speech systems, the geometric properties of these modules, and their implications for steerability, remain poorly understood. We present the first comparative study of speech language model (SLM) and conditional flow...
Michael Tatarjitzky, Vladimir Tourbabin, Boaz Rafaely ยท arXiv (Submitted to IEEE, likely IEEE/ACM TASLP or similar based on formatting, but venue listed as arXiv in metadata)
Multichannel Deep Neural Networks (DNNs) have significantly improved speech enhancement performance; however, they typically remain constrained by reliance on fixed microphone array geometries, leading to poor generalization on unseen or irregular configurations. Current array-ag...
Tuesday, June 30, 2026
Liming Wang, Neguine Rezaii, Bradford C. Dickerson ... ยท arXiv
Multimodal large language models (MLLMs) have emerged as a promising approach for improving the accuracy, transferability, and explainability of automatic dementia classification (ADC) systems from voice recordings. Yet it remains unclear whether their reasoning capabilities are ...
Carlos Penarrubia, Antonio Rios-Vila, Eliseo Fuentes-Martinez ... ยท arXiv
Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score ...
Philipp Grundhuber, Emanuรซl A. P. Habets ยท Interspeech 2026
Some neural audio codecs disentangle speech into latent subspaces encoding content, speaker identity, and acoustics, enabling acoustic teleportation and voice conversion. Existing evaluations rely on cross-reconstruction quality, which cannot reliably detect leakage across partit...
Monday, June 29, 2026
Yoonjeong Park, Jaekwon Im, Juhan Nam ยท Interspeech 2026
Text-based singing voice editing (SVE) aims to revise sung lyrics while preserving the original melody, total duration, and non-edited regions. In this paper, we propose MeloDISinger, a flow-matching-based SVE model for melody-aware and duration-preserving editing. Its core modul...
Qiyang Sun, Yi Chang, Zixing Zhang ... ยท arXiv (preprint)
Speech conveys rich emotional information. As Speech Emotion Recognition (SER) is usually deployed in privacy-sensitive and reliability-critical environments, adversarial attacks on SER have attracted increasing attention. Existing sparse attacks control the number of perturbed e...
Pranav Tushar, Xiao Xiao Miao, Rong Tong ยท INTERSPEECH 2026
Voice anonymization aims to protect speaker identity while preserving linguistic content and speech usability. However, most anonymization systems are developed on adult speech, leading to degraded performance when applied to child speech. This paper investigates child-centric an...
Sunday, June 28, 2026
Hoyeol Sohn, Juhan Nam ยท INTERSPEECH 2026
Variable frame rate (VFR) coding has recently emerged in neural speech codecs, allocating fewer frames to redundant regions and more frames to rapidly changing speech. VFR must transmit side information about retained time steps, but prior gains are either not rigorously addresse...
Yichi Wang, Junzhe Chen, Wangjin Zhou ... ยท arXiv
In long-form multi-party conversations, highly imbalanced speaker activity and frequent overlap make it difficult to identify "who spoke when and what". Sliding-window continuous speech separation (CSS) mitigates sparse supervision, but often suffers from cross-window speaker inc...
Sujin Koo, Sangyoon Kim, Ji Sub Um ... ยท Interspeech 2026
Noise-robust bandwidth expansion aims to reconstruct high-fidelity wideband speech from noisy low-resolution inputs. While flow matching has shown strong performance in speech generation, accurately recovering clean speech from noisy inputs remains challenging due to the ambiguit...
Piyush Arora, Navlika Singh, Umberto Cappellazzo ... ยท INTERSPEECH 2026
Audio-Visual Speech Recognition takes two input modalities, acoustic and visual streams, where visual information from lip movements aids recognition when audio is noisy. Recently, LLM-based AVSR models have emerged as a promising paradigm by connecting pre-trained audio-visual e...
Saturday, June 27, 2026
Fengjie Lu, Chenang Jiang, Jiarui Hai ... ยท arXiv
Recent advances in language--audio retrieval have been largely driven by contrastive dual-encoder architectures that align audio and text in a shared embedding space. While effective, existing retrieval embeddings are primarily optimized for audio--caption matching, limiting thei...
Friday, June 26, 2026
Sihang Nie, Xiaofen Xing, Rui Xing ... ยท arXiv
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimiz...
Jonghyeon Park, Olivier Jiyoun Jung, Myungwoo Oh ยท INTERSPEECH 2026
Early detection of dementia enables timely intervention, and reflecting cognitive impairment, spontaneous speech offers a non-invasive screening modality. Conventional approaches often focus on a single representational dimension -- such as acoustic descriptors, pause modeling, a...
Thursday, June 25, 2026
Zahra Omidi, John H. L. Hansen ยท ICASSP 2026
The variations in vocal effort range (e.g. whisper, soft, neutral, loud, shout) alter production and speech acoustics, reducing intelligibility and limiting the robustness of any subsequent speech technology. Classification is challenging since effort lies on a continuum, adjacen...
Adhiraj Banerjee, Vipul Arora ยท INTERSPEECH 2026
Learning discrete speech representations that preserve similarity across variable-length utterances is central to query-by-example spoken term detection (QbE-STD). While wav2tok introduced CTC-based sequence alignment to enforce token consistency, its tightly coupled clustering a...
Xinyu Liang, Fredrik Cumlin, Victor Ungureanu ... ยท Interspeech 2026
We introduce DNSMOS-C, a compact end-to-end speech quality assessment model that extends the DNSMOS Pro framework by integrating a MOS-guided triplet-based contrastive loss. Applied directly to the intermediate embeddings, this contrastive supervision encourages the latent space ...