Audio ML Papers

Last 7 Days (May 22 - May 29, 2026)

Subcategories: All (43) | Speech Synthesis (13) | Music Synthesis (8) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (2) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (16)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Nazif Can Tamer, Victoria Ebert, Guang Yang ... · arXiv
We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, an...
#2 TOP PAPER (Score: 89)
Yuanyuan Wang, Dongchao Yang, Yayue Deng ... · ACL 2026 (Main)
Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically t...
#3 TOP PAPER (Score: 87)
Jingping Fang, Lin Chen, Chenyang Xu ... · arXiv
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditi...
Thursday, May 28, 2026
Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu ... · arXiv
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies ...
Tiantian Feng, Anfeng Xu, Xuan Shi ... · arXiv
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, ...
Yonggang Zhu, Liting Gao, Aidong Men ... · arXiv
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Exis...
Bohan Li, Shi Lian, Hankun Wang ... · arXiv
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architect...
Sung-Lin Yeh, Wei Zhou, Gil Keren ... · arXiv
Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce ...
Heejoon Koo, Yoon Tae Kim, Miika Toikkanen ... · arXiv
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced de...
Zhaoyan Pan, Xiangdong Li, Wenke Wu ... · arXiv
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonver...
Xiangyu Zhang, Yuxin Li, Haoyang Zhang ... · arXiv
The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality....
S. Sutharya, Remya K. Sasi · arXiv
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguis...
Wednesday, May 27, 2026
Jiahao Mei, Heinrich Dinkel, Yadong Niu ... · arXiv
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audi...
Jashin Ye, Dongxiao Wang, Yixuan Ye ... · arXiv
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenate...
Changhao Pan, Rui Yang, Han Wang ... · ACL 2026 (Findings)
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test...
Chong Jing, Zitong Lan, Junan Zhang ... · arXiv
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR pr...
Zhisheng Zhang, Xiang Li, Yixuan Zhou ... · arXiv
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which incr...
Yuyue Wang, Xihua Wang, Xin Cheng ... · arXiv
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs an...
Ryan Fosdick · arXiv
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoisin...
Yucheng Wang, Jing Peng, Hanqi Li ... · arXiv
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence ac...
Tuesday, May 26, 2026
Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani · ICML 2026
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 v...
Xiao-Hang Jiang, Yang Ai, Hui-Peng Du ... · IEEE Transactions on Audio, Speech and Language Processing
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neu...
Xudong Lu, Xueying Li, Annan Wang ... · arXiv
We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual s...
Bowen Li, Shaotong Guo, Zhen Wang ... · arXiv
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregr...
Yang Xiao, Siyi Wang, Han Yin ... · arXiv
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, ...
Abhinaba Roy, Junyi Liang, Dorien Herremans · arXiv
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework fo...
Monday, May 25, 2026
Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith ... · arXiv
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal ...
Loukas Ilias, Dimitris Askounis · arXiv
Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve...
Junyang Chen, Yuhang Jia, Hui Wang ... · arXiv
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT...
Nobutaka Ito, Yoshiaki Bando · arXiv
Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw ...
Wangzixi Zhou, Bagus Tris Atmaja, Sakriani Sakti · Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 1-6, 2025 · Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)
While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-qualit...
Hui-Peng Du, Yang Ai, Xiao-Hang Jiang ... · IEEE/ACM Transactions on Audio, Speech, and Language Processing
Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, ...
Wangzixi Zhou, Takuma Okamoto, Yamato Ohtani ... · Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 17012-17016, 2026 · Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing
Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, ...
Jinju Kim, Yunsung Kang, Gyeong-Moon Park ... · arXiv
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Exist...
Nobutaka Ito · arXiv
Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without expli...
Patricia Hu, Silvan Peter, Gerhard Widmer · Music Encoding Conference (MEC) 2026
In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in...
Sunday, May 24, 2026
Yang Xiao, Siyi Wang, Eun-Jung Holden ... · arXiv
Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations....
Saturday, May 23, 2026
Yoonhyung Lee, Hyunsin Park, Jinhwan Park ... · ACL 2026 (Main Conference)
Recent advances in zero-shot text-to-speech (TTS) have enabled accurate imitation of reference speech in terms of both speaking style and speaker timbre. However, achieving disentangled control over these aspects from separate references remains a challenging task. Several studie...
Friday, May 22, 2026
Bin Lin, Bo Zhao, Boyong Wu ... · arXiv
Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems acro...
Zhaoyang Meng, Zhengyao Ma, Kecan Mao ... · arXiv
Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affec...
Qingcao Li, Yipeng Lin, Weichen Lian ... · ICME 2026
Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-super...
Saebyeol Shin, Chao Wan, Zhenzhen Liu ... · arXiv
Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but...
Renhe Sun, Jiayi Zhou, Haolin He ... · arXiv
In this technical report, we describe our submission for the WildSpoof Challenge TTS Track: Text-to-Speech with In-the-Wild Data. We introduce F5-TTS-DPS, a model built upon the F5-TTS architecture. Our approach integrates Exponential Moving Average (EMA) into supervised fine-tun...