Audio ML Papers

Last 7 Days (May 09 - May 16, 2026)

Subcategories: All (42) | Speech Synthesis (5) | Music Synthesis (7) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (3) | Asr (5) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (20)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Yeongtak Oh, Dongwook Lee, Sangkwon Park ... · arXiv
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor ...
#2 TOP PAPER (Score: 91)
Guojian Li, Zhixian Zhao, Zhennan Lin ... · arXiv
While speech Large Language Models (LLMs) excel at conventional tasks like basic speech recognition, they lack fine-grained, multi-dimensional perception. This deficiency is evident in their struggle to disentangle complex features like micro-acoustic cues, acoustic scenes, and p...
#3 TOP PAPER (Score: 84)
Boda Xiao, Bo Wang, Heping Cheng · arXiv
Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test M...
Thursday, May 14, 2026
Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross ... · arXiv
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhan...
KiHyun Nam, Jungwoo Heo, Siu Bae ... · arXiv
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. ...
Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao ... · arXiv
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argume...
Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani ... · arXiv
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising ...
Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara · arXiv
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing method...
Dinanath Pathya, Sajen Maharjan, Binita Adhikari ... · arXiv
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target spee...
Shuyang Cui, Zhi Zhong, Qiyu Wu ... · arXiv
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for...
Wednesday, May 13, 2026
Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong ... · arXiv
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we pro...
Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz ... · arXiv
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversatio...
Ningyuan Yang, Sile Yin, Li-Chia Yang ... · European Signal Processing Conference (EUSIPCO) 2026
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable ...
Zhongju Yuan, Geraint Wiggins, Dick Botteldooren · ICML 2026
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Archit...
Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris ... · arXiv
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features ...
Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon ... · arXiv
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and...
Keshav Bhandari, Sungkyun Chang, Abhinaba Roy ... · arXiv
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored i...
Tuesday, May 12, 2026
Adam Wynn, Jingyun Wang · arXiv
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Wh...
Yiming Ren, Xuenan Xu, Ziyang Zhang ... · arXiv
Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human...
Che Liu, Lichao Ma, Xiangyu Tony Zhang ... · arXiv
Omni-modal language models are intended to jointly understand audio, visual inputs, and language, but benchmark gains can be inflated when visual evidence alone is enough to answer a query. We study whether current omni-modal benchmarks separate visual shortcuts from genuine audi...
Wen Shen Teo, Takafumi Moriya, Masato Mimura · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 18282-18286 · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
We propose the Chunkwise Aligner, a novel architecture for streaming automatic speech recognition (ASR). While the Transducer is the standard model for streaming ASR, its training is costly due to the need to compute all possible audio-label alignments. The recently introduced Al...
Chen Geng, Meng Chen, Ruohua Zhou ... · ICASSP 2026
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean ...
Jaehoon Ahn, Tae Gum Hwang, Moon-Ryul Jung · arXiv
Over the past two decades, the task of musical beat tracking has transitioned from heuristic onset detection algorithms to highly capable deep neural networks (DNN). Although DNN-based beat tracking models achieve near-perfect performance on mainstream, percussive datasets, the S...
Joshua Opria · arXiv
We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-sta...
Monday, May 11, 2026
Jiacheng Shi, Hongfei Du, Xinyuan Song ... · ACL Findings 2026
Neural speech codecs provide discrete representations for speech language models, but emotional cues are often degraded during quantization. Existing codecs mainly optimize acoustic reconstruction, leaving emotion expressiveness insufficiently modeled at the representation level....
Konstantinos Soiledis, Maximos Kaliakatsos-Papakostas, Dimos Makris ... · arXiv
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity...
Francesco Paissan, Luca Della Libera, Mirco Ravanelli ... · arXiv
Neural audio codecs provide compact discrete representations for speech generation and manipulation. However, most codecs organize tokens as frame-level sequences, making it difficult to study or intervene on global factors of variation. In this work, we propose the Latent Audio ...
Qijie You, Hao Liang, Mingrui Chen ... · arXiv
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form...
Haowen Li, Tianxiang Li, Yi Yang ... · ICML 2026
The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. ...
Ege Erdem, Shoichi Koyama, Tomohiko Nakamura ... · arXiv
Reconstructing a 3D sound field from sparse microphone measurements is a fundamental yet ill-posed problem, which we address through Acoustic Transfer Function (ATF) magnitude estimation. ATF magnitude encapsulates key perceptual and acoustic properties of a physical space with a...
Dimos Makris, András Barják, Maximos Kaliakatsos-Papakostas · 2026 IEEE World Congress on Computational Intelligence, IJCNN Track
Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal s...
Piotr Kawa, Kornel Howil, Piotr Borycki ... · arXiv
Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. W...
Yakun Liu, Hai Luan, Dong Liu ... · 9 pages, 5 figures, IEEE conference format
In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers...
Alejandro Luebs, Mithilesh Vaidya, Ishaan Kumar ... · arXiv
The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we de...
Sunday, May 10, 2026
Tianrui Wang, Ziyang Ma, Yizhou Peng ... · arXiv
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-dri...
Dong Yang, Yiyi Cai, Haoyu Zhang ... · arXiv
Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Ma...
Yifan Wang, Peiwu Wang, Yunxian Chi ... · ICMR 2026 (Main Track, Long Paper)
Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cue...
Leduo Chen, Junchuan Zhao, Shengchen Li · arXiv
Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipel...
Saturday, May 09, 2026
Yuxin Kong, Peng Yang, Chongbin Yi ... · IEEE ICME 2026
Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizatio...
Tao Yu, yiming ding, Shenghua Chai ... · arXiv
Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-Deep...
Zheng Wang, Xiaobin Rong, Hang Su ... · arXiv
Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-i...
Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila · SMC 2026 conference
Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it syste...