Audio ML Papers

Last 7 Days (May 12 - May 19, 2026)

Subcategories: All (29) | Speech Synthesis (1) | Music Synthesis (5) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (1) | Asr (6) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Guojian Li, Zhixian Zhao, Zhennan Lin ... · arXiv
While speech Large Language Models (LLMs) excel at conventional tasks like basic speech recognition, they lack fine-grained, multi-dimensional perception. This deficiency is evident in their struggle to disentangle complex features like micro-acoustic cues, acoustic scenes, and p...
#2 TOP PAPER (Score: 84)
Boda Xiao, Bo Wang, Heping Cheng · arXiv
Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test M...
#3 TOP PAPER (Score: 84)
KiHyun Nam, Jungwoo Heo, Siu Bae ... · arXiv
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. ...
Friday, May 15, 2026
Changheon Han, Ashkan Panahi, Kıvanç Tatar · arXiv
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, ...
Zhongjie Ba, Liang Yi, Peng Cheng ... · arXiv
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key t...
Yuqing Cheng, Xingyu Ma, Guochen Yu ... · arXiv
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imp...
Sebastian Braun · arXiv
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. Whil...
Thursday, May 14, 2026
Alexander Polok, Ivan Medennikov, Jan Černocký ... · arXiv
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixt...
Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross ... · arXiv
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhan...
Yuyuan Liu, Yuanhong Chen, Chong Wang ... · arXiv
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or ...
Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao ... · arXiv
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argume...
Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani ... · arXiv
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising ...
Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara · arXiv
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing method...
Dinanath Pathya, Sajen Maharjan, Binita Adhikari ... · arXiv
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target spee...
Shuyang Cui, Zhi Zhong, Qiyu Wu ... · arXiv
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for...
Wednesday, May 13, 2026
Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong ... · arXiv
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we pro...
Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz ... · arXiv
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversatio...
Ningyuan Yang, Sile Yin, Li-Chia Yang ... · European Signal Processing Conference (EUSIPCO) 2026
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable ...
Zhongju Yuan, Geraint Wiggins, Dick Botteldooren · ICML 2026
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Archit...
Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris ... · arXiv
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features ...
Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon ... · arXiv
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and...
Keshav Bhandari, Sungkyun Chang, Abhinaba Roy ... · arXiv
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored i...
Tuesday, May 12, 2026
Adam Wynn, Jingyun Wang · arXiv
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Wh...
Yiming Ren, Xuenan Xu, Ziyang Zhang ... · arXiv
Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human...
Che Liu, Lichao Ma, Xiangyu Tony Zhang ... · arXiv
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audi...
Wen Shen Teo, Takafumi Moriya, Masato Mimura · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 18282-18286 · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We propose the Chunkwise Aligner, a novel architecture for streaming automatic speech recognition (ASR). While the Transducer is the standard model for streaming ASR, its training is costly due to the need to compute all possible audio-label alignments. The recently introduced Al...
Chen Geng, Meng Chen, Ruohua Zhou ... · ICASSP 2026
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean ...
Jaehoon Ahn, Tae Gum Hwang, Moon-Ryul Jung · arXiv
Over the past two decades, the task of musical beat tracking has transitioned from heuristic onset detection algorithms to highly capable deep neural networks (DNN). Although DNN-based beat tracking models achieve near-perfect performance on mainstream, percussive datasets, the S...
Joshua Opria · arXiv
We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-sta...