Audio ML Papers

Week of May 24 - May 31, 2026

Subcategories: All (45) | Speech Synthesis (13) | Music Synthesis (9) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (3) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (16)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Zhaoqing Li, Haoning Xu, Jingran Su ... · arXiv
We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing...
#2 TOP PAPER (Score: 87)
Jingping Fang, Lin Chen, Chenyang Xu ... · arXiv
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditi...
#3 TOP PAPER (Score: 87)
Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani · ICML 2026
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 v...
Saturday, May 30, 2026
Sukru Samet Dindar, Riki Shimizu, Xilin Jiang ... · arXiv
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inf...
Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen ... · arXiv
We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the densit...
Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams ... · arXiv
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP),...
Friday, May 29, 2026
Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas ... · EUSIPCO 2026 (34th European Signal Processing Conference)
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of th...
Guangyin Bao, Taiping Zeng, Jianfeng Feng ... · arXiv
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive ...
Kevin Everson, Mari Ostendorf · ACL 2026
Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propo...
Thursday, May 28, 2026
Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu ... · arXiv
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies ...
Tiantian Feng, Anfeng Xu, Xuan Shi ... · arXiv
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, ...
Yonggang Zhu, Liting Gao, Aidong Men ... · arXiv
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Exis...
Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha ... · arXiv
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper an...
Pedro H. L. Leite, Pedro Benevenuto Valadares, Luiz W. P. Biscainho · XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026)
Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reli...
Bohan Li, Shi Lian, Hankun Wang ... · arXiv
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architect...
Sung-Lin Yeh, Wei Zhou, Gil Keren ... · arXiv
Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce ...
Heejoon Koo, Yoon Tae Kim, Miika Toikkanen ... · arXiv
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced de...
Zhaoyan Pan, Xiangdong Li, Wenke Wu ... · arXiv
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonver...
Xiangyu Zhang, Yuxin Li, Haoyang Zhang ... · arXiv
The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality....
S. Sutharya, Remya K. Sasi · arXiv
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguis...
Wallace Abreu, Luiz W. P. Biscainho · arXiv
Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMa...
Wednesday, May 27, 2026
Jiahao Mei, Heinrich Dinkel, Yadong Niu ... · arXiv
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audi...
Jashin Ye, Dongxiao Wang, Yixuan Ye ... · arXiv
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenate...
Changhao Pan, Rui Yang, Han Wang ... · ACL 2026 (Findings)
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test...
Chong Jing, Zitong Lan, Junan Zhang ... · arXiv
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR pr...
Zhisheng Zhang, Xiang Li, Yixuan Zhou ... · arXiv
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which incr...
Yuyue Wang, Xihua Wang, Xin Cheng ... · arXiv
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs an...
Ryan Fosdick · arXiv
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoisin...
Yucheng Wang, Jing Peng, Hanqi Li ... · arXiv
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence ac...
Tuesday, May 26, 2026
Xiao-Hang Jiang, Yang Ai, Hui-Peng Du ... · IEEE Transactions on Audio, Speech and Language Processing
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neu...
Xudong Lu, Xueying Li, Annan Wang ... · arXiv
We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual s...
Bowen Li, Shaotong Guo, Zhen Wang ... · arXiv
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregr...
Yang Xiao, Siyi Wang, Han Yin ... · arXiv
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, ...
Abhinaba Roy, Junyi Liang, Dorien Herremans · arXiv
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework fo...
Monday, May 25, 2026
Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith ... · arXiv
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal ...
Loukas Ilias, Dimitris Askounis · arXiv
Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve...
Junyang Chen, Yuhang Jia, Hui Wang ... · arXiv
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT...
Nobutaka Ito, Yoshiaki Bando · arXiv
Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw ...
Wangzixi Zhou, Bagus Tris Atmaja, Sakriani Sakti · Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 1-6, 2025 · Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-qualit...
Hui-Peng Du, Yang Ai, Xiao-Hang Jiang ... · IEEE/ACM Transactions on Audio, Speech, and Language Processing
Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, ...
Wangzixi Zhou, Takuma Okamoto, Yamato Ohtani ... · Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 17012-17016, 2026 · Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing
Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, ...
Jinju Kim, Yunsung Kang, Gyeong-Moon Park ... · arXiv
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Exist...
Nobutaka Ito · arXiv
Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without expli...
Patricia Hu, Silvan Peter, Gerhard Widmer · Music Encoding Conference (MEC) 2026
In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in...
Sunday, May 24, 2026
Yang Xiao, Siyi Wang, Eun-Jung Holden ... · arXiv
Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations....