Audio ML Papers

Last 7 Days (May 13 - May 20, 2026)

Subcategories: All (43) | Speech Synthesis (2) | Music Synthesis (8) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (3) | Asr (5) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (19)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Feiyan Zhou, Luyuan Wang, Shoufa Chen ... · arXiv
Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without...
#2 TOP PAPER (Score: 84)
Boda Xiao, Bo Wang, Heping Cheng · arXiv
Decoding speech from non-invasive brain signals is challenging. For the LibriBrain 2025 Speech Detection task, we propose a novel two-step framework that bypasses direct reconstruction. First, a contrastive learning model retrieves the matching speech segment for the given test M...
#3 TOP PAPER (Score: 84)
KiHyun Nam, Jungwoo Heo, Siu Bae ... · arXiv
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. ...
Monday, May 18, 2026
Kai-Chen Tsai, Tien-Hong Lo, Yun-Ting Sun ... · arXiv
Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the chal...
Hengyan Huang, Xiaoxuan Guo, Jiayi Zhou ... · arXiv
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for ...
Jiatong Li, Wiebke Middelberg, Simon Doclo · arXiv
Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades sign...
Jianhong Ye, Haiquan Zhao, Shaohui Lv ... · arXiv
The conventional normalized subband p-norm (NSPN) algorithm achieves robustness in $α$-stable noise ($1<α\leq 2$) by utilizing low-order error moments. However, its performance degrades significantly under three scenarios: (1) non-Gaussian inputs, (2) $α$-stable noise with $0<α\l...
Chaolei Han, Hongsong Wang, Jie Gui · ICML 2026
Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degrada...
Jun Xue, Tong Zhang, Zhuolin Yi ... · IJCAI 2026
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that f...
Julian D. Parker, Zach Evans, CJ Carr ... · arXiv
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and g...
Jing Zhao, KokSheik Wong, Vishnu Monn Baskaran ... · arXiv
The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to t...
Zach Evans, Julian D. Parker, Matthew Rice ... · arXiv
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations f...
Attia Nafees ul Haq, Zeyu Zhu, Jingbin Hu ... · arXiv
Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-t...
Md Hasan, Nyvenn Castro, Daiqi Liu ... · arXiv
Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, tempo...
Sunday, May 17, 2026
Yuanbo Hou, Zhaoyi Liu, Tong Ye ... · arXiv
Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue bec...
Weixing Wei, Raynaldi Lalang, Dichucheng Li ... · ICASSP 2026
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note ...
Heejoon Koo · arXiv
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Eac...
Huakang Chen, Wenkai Cheng, Guobin Ma ... · arXiv
High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coars...
Keisuke Imoto, Yamato Kojima, Takao Tsuchiya · arXiv
Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impr...
Saturday, May 16, 2026
Huimeng Wang, Hui Lu, Jiajun Deng ... · arXiv
Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech represent...
Yuyang Yan, Sami O. Simons, Visara Urovi · arXiv
Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-i...
Prem Seetharaman, Rithesh Kumar · Proc. ICASSP 2026 · Proc. ICASSP 2026
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a ...
Friday, May 15, 2026
Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds ... · arXiv
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss....
Ningyuan Yang, Yize Li, Diego A. Cuji ... · arXiv
Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey ...
Changheon Han, Ashkan Panahi, Kıvanç Tatar · arXiv
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, ...
Zhongjie Ba, Liang Yi, Peng Cheng ... · arXiv
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key t...
Yuqing Cheng, Xingyu Ma, Guochen Yu ... · arXiv
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imp...
Sebastian Braun · arXiv
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. Whil...
Thursday, May 14, 2026
Alexander Polok, Ivan Medennikov, Jan Černocký ... · arXiv
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixt...
Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross ... · arXiv
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhan...
Yuyuan Liu, Yuanhong Chen, Chong Wang ... · arXiv
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or ...
Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao ... · arXiv
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argume...
Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani ... · arXiv
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising ...
Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara · arXiv
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing method...
Dinanath Pathya, Sajen Maharjan, Binita Adhikari ... · arXiv
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target spee...
Shuyang Cui, Zhi Zhong, Qiyu Wu ... · arXiv
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for...
Wednesday, May 13, 2026
Terry Yi Zhong, Cristian Tejedor-Garcia, Khiet P. Truong ... · arXiv
Early-stage Parkinson's disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we pro...
Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz ... · arXiv
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversatio...
Ningyuan Yang, Sile Yin, Li-Chia Yang ... · European Signal Processing Conference (EUSIPCO) 2026
High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively large and open, contains a considerable ...
Zhongju Yuan, Geraint Wiggins, Dick Botteldooren · ICML 2026
Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Archit...
Konstantinos Soiledis, Maximos Kaliakatsos Papakostas, Dimos Makris ... · arXiv
Symbolic-control drum generation requires preserving explicit event timing and dynamics while synthesizing acoustically plausible waveforms. We present Sec2Drum-DAC, a conditional latent-diffusion model for symbolic-to-audio drum rendering. The model conditions on event features ...
Kush Juvekar, Kavya Manohar, Aditya Srinivas Menon ... · arXiv
Fine-tuning multilingual ASR models like Whisper for low-resource languages often improves read speech but degrades spontaneous audio performance, a phenomenon we term studio-bias. To diagnose this mismatch, we introduce Vividh-ASR, a complexity-stratified benchmark for Hindi and...
Keshav Bhandari, Sungkyun Chang, Abhinaba Roy ... · arXiv
Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored i...