Audio ML Papers

Last 7 Days (June 26 - July 03, 2026)

Subcategories: All (24) | Speech Synthesis (0) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (2) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (22)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 81)
Yujie Tu, Yifan Yang, Tianrui Wang ... · arXiv
While modern ASR systems achieve low error rates on high-resource benchmarks, such performance often overestimates real-world robustness. Existing evaluations address challenges in isolation, lacking a unified benchmark for domain terminology, age variation, dialects, accents, an...
#2 TOP PAPER (Score: 75)
Shun Lei, Huaicheng Zhang, Dapeng Wu ... · arXiv
Full-length song generation must preserve coherence and musicality, render detailed vocal and accompaniment acoustics, and follow lyrics and prompts. Existing language model-based systems face a structural trade-off: mixed-token modeling preserves vocal-instrument coordination bu...
#3 TOP PAPER (Score: 74)
Yiming Sun, Chen Chen, Zifan Zhou ... · arXiv (Preprint)
Phone-use Agents can execute complex tasks end to end across real mobile applications. By operating a real device on the user's behalf, they reach far more functionalities than CLI agents, which amplifies the real-world harm they can cause when driven for malicious purposes. We p...
Thursday, July 02, 2026
Haoran Wang, Jinchuan Tian, Siddhant Arora ... · arXiv
While Large Multimodal Models excel in comprehension, high-throughput inference engines lack native support for multimodal generation. This is severe in Speech Language Models, where generating multi-layered audio tokens via decoupled AR+NAR or synchronous Multi-Token Prediction ...
Ziyang Jiang, Yu Chen, Zexu Pan ... · arXiv
Humans can selectively attend to a target sound and estimate its direction in complex scenarios, whereas such selective localization remains challenging for current deep learning-based systems. Sound source localization (SSL) has achieved remarkable success with deep learning, ye...
Chengwei Liu, Shaofei Xue, Haoyin Yan ... · Interspeech 2026
We propose a lightweight multi-path alignment network (LMPAN) for on-device joint acoustic echo cancellation (AEC) and noise suppression (NS) in full-duplex spoken dialogue systems. To address hardware-induced distortions and dynamic acoustic conditions, we introduce three core i...
Z. Benslimane, P. Chouteau, M. Poreba ... · Interspeech 2026
Real-time binaural speech enhancement is constrained by latency, computational cost, and inter-device communication, yet existing efficient solutions predominantly address single-channel settings. In this paper, we introduce RT-Tango, a real-time distributed binaural speech enhan...
Balint Turi, Archontis Politis, Parthasaarathy Sudarsanam ... · EUSIPCO 2026
Estimating a speaker's head orientation from audio can provide valuable information in smart environments, meetings, and driver monitoring. We propose a novel approach that leverages the phase component of the short-time Fourier transform from a single microphone array as input t...
Wednesday, July 01, 2026
Yibo Bai, Sizhou Chen, Michele Panariello ... · IEEE/ACM Transactions on Audio, Speech, and Language Processing (Inferred from "Journal of Class Files... August 2021" and IEEE keywords, though likely an arXiv preprint version of a journal submission)
Modern automatic speaker verification (ASV) systems are vulnerable to adversarial perturbations. Diffusion-based purification has recently shown strong effectiveness against such perturbations, but its reverse denoising process requires iterative sampling and leads to high infere...
Siyi Wang, James Bailey, Ting Dang · arXiv (Submitted to ICML 2026 based on footer)
While prior work has explored emotion control in hybrid text-to-speech systems, the geometric properties of these modules, and their implications for steerability, remain poorly understood. We present the first comparative study of speech language model (SLM) and conditional flow...
Michael Tatarjitzky, Vladimir Tourbabin, Boaz Rafaely · arXiv (Submitted to IEEE, likely IEEE/ACM TASLP or similar based on formatting, but venue listed as arXiv in metadata)
Multichannel Deep Neural Networks (DNNs) have significantly improved speech enhancement performance; however, they typically remain constrained by reliance on fixed microphone array geometries, leading to poor generalization on unseen or irregular configurations. Current array-ag...
Tuesday, June 30, 2026
Liming Wang, Neguine Rezaii, Bradford C. Dickerson ... · arXiv
Multimodal large language models (MLLMs) have emerged as a promising approach for improving the accuracy, transferability, and explainability of automatic dementia classification (ADC) systems from voice recordings. Yet it remains unclear whether their reasoning capabilities are ...
Jiaqi Li, Chaoren Wang, Xiaohai Tian ... · arXiv
Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. R...
Carlos Penarrubia, Antonio Rios-Vila, Eliseo Fuentes-Martinez ... · arXiv
Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score ...
Monday, June 29, 2026
Yuxuan Hu, Heng Lu, Ruchao Fan ... · arXiv
Strong speech-to-text (S2T) LLMs already provide robust speech perception and text reasoning, but adding speech-to-speech (S2S) output is challenging: fine-tuning the backbone can degrade the original S2T performance, while attaching a downstream talker reintroduces a serial text...
Yoonjeong Park, Jaekwon Im, Juhan Nam · Interspeech 2026
Text-based singing voice editing (SVE) aims to revise sung lyrics while preserving the original melody, total duration, and non-edited regions. In this paper, we propose MeloDISinger, a flow-matching-based SVE model for melody-aware and duration-preserving editing. Its core modul...
Qiyang Sun, Yi Chang, Zixing Zhang ... · arXiv (preprint)
Speech conveys rich emotional information. As Speech Emotion Recognition (SER) is usually deployed in privacy-sensitive and reliability-critical environments, adversarial attacks on SER have attracted increasing attention. Existing sparse attacks control the number of perturbed e...
Sunday, June 28, 2026
Hoyeol Sohn, Juhan Nam · INTERSPEECH 2026
Variable frame rate (VFR) coding has recently emerged in neural speech codecs, allocating fewer frames to redundant regions and more frames to rapidly changing speech. VFR must transmit side information about retained time steps, but prior gains are either not rigorously addresse...
Yichi Wang, Junzhe Chen, Wangjin Zhou ... · arXiv
In long-form multi-party conversations, highly imbalanced speaker activity and frequent overlap make it difficult to identify "who spoke when and what". Sliding-window continuous speech separation (CSS) mitigates sparse supervision, but often suffers from cross-window speaker inc...
Sujin Koo, Sangyoon Kim, Ji Sub Um ... · Interspeech 2026
Noise-robust bandwidth expansion aims to reconstruct high-fidelity wideband speech from noisy low-resolution inputs. While flow matching has shown strong performance in speech generation, accurately recovering clean speech from noisy inputs remains challenging due to the ambiguit...
Piyush Arora, Navlika Singh, Umberto Cappellazzo ... · INTERSPEECH 2026
Audio-Visual Speech Recognition takes two input modalities, acoustic and visual streams, where visual information from lip movements aids recognition when audio is noisy. Recently, LLM-based AVSR models have emerged as a promising paradigm by connecting pre-trained audio-visual e...
Saturday, June 27, 2026
Fengjie Lu, Chenang Jiang, Jiarui Hai ... · arXiv
Recent advances in language--audio retrieval have been largely driven by contrastive dual-encoder architectures that align audio and text in a shared embedding space. While effective, existing retrieval embeddings are primarily optimized for audio--caption matching, limiting thei...
Friday, June 26, 2026
Sihang Nie, Xiaofen Xing, Rui Xing ... · arXiv
Recently, Large Language Model (LLM)-based Text-to-Speech (TTS) models have achieved remarkable naturalness. However, the standard Supervised Fine-Tuning paradigm often converges to statistically averaged prosody, limiting emotional expressiveness. While preference-driven optimiz...
Jonghyeon Park, Olivier Jiyoun Jung, Myungwoo Oh · INTERSPEECH 2026
Early detection of dementia enables timely intervention, and reflecting cognitive impairment, spontaneous speech offers a non-invasive screening modality. Conventional approaches often focus on a single representational dimension -- such as acoustic descriptors, pause modeling, a...