Audio ML Papers

Last 7 Days (May 14 - May 21, 2026)

Subcategories: All (44) | Speech Synthesis (3) | Music Synthesis (7) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (2) | Asr (7) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (18)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Feiyan Zhou, Luyuan Wang, Shoufa Chen ... · arXiv
Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without...
#2 TOP PAPER (Score: 85)
Zhifei Xie, Kaiyu Pang, Haobin Zhang ... · arXiv
Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under s...
#3 TOP PAPER (Score: 84)
KiHyun Nam, Jungwoo Heo, Siu Bae ... · arXiv
As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. ...
Tuesday, May 19, 2026
Junyi Wang, Chi Zhang, Jing Qian ... · arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
Zhong-Qiu Wang, Samuele Cornell · arXiv
In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signa...
Silvan Peter, Patricia Hu, Gerhard Widmer · Music Encoding Conference (MEC) 2026
Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparabl...
Andreas Triantafyllopoulos, Jakub Šťastný, Alexios Terpinas ... · arXiv
Reinforcement learning is a powerful learning paradigm that has spearheaded progress in numerous domains. Its core promise lies in learning through high-level goals without the need for granular labels. However, it still remains elusive in the realm of audio, where it has receive...
Hirotaka Nishikori, Nobutaka Ito, Kouei Yamaoka ... · arXiv
Distributed microphone arrays composed of multiple subarrays enable blind source separation over a wide spatial area. Directly applying fast multichannel nonnegative matrix factorization (FastMNMF) to all subarrays can exploit observations from all subarrays, but it requires repe...
Monday, May 18, 2026
Kai-Chen Tsai, Tien-Hong Lo, Yun-Ting Sun ... · arXiv
Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the chal...
Gyubin Lee, Junwon Lee, Juhan Nam · accepted to CVPR 2026 Workshop on Sight and Sound
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchore...
Hengyan Huang, Xiaoxuan Guo, Jiayi Zhou ... · arXiv
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for ...
Jiatong Li, Wiebke Middelberg, Simon Doclo · arXiv
Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades sign...
Jianhong Ye, Haiquan Zhao, Shaohui Lv ... · arXiv
The conventional normalized subband p-norm (NSPN) algorithm achieves robustness in $α$-stable noise ($1<α\leq 2$) by utilizing low-order error moments. However, its performance degrades significantly under three scenarios: (1) non-Gaussian inputs, (2) $α$-stable noise with $0<α\l...
Yanru Wu, Jianning Wang, Chongxin Gan ... · arXiv
Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impac...
Chaolei Han, Hongsong Wang, Jie Gui · ICML 2026
Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degrada...
Jun Xue, Tong Zhang, Zhuolin Yi ... · IJCAI 2026
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that f...
Julian D. Parker, Zach Evans, CJ Carr ... · arXiv
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and g...
Jing Zhao, KokSheik Wong, Vishnu Monn Baskaran ... · arXiv
The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to t...
Zach Evans, Julian D. Parker, Matthew Rice ... · arXiv
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations f...
Attia Nafees ul Haq, Zeyu Zhu, Jingbin Hu ... · arXiv
Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-t...
Md Hasan, Nyvenn Castro, Daiqi Liu ... · arXiv
Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, tempo...
Sunday, May 17, 2026
Yuanbo Hou, Zhaoyi Liu, Tong Ye ... · arXiv
Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue bec...
Weixing Wei, Raynaldi Lalang, Dichucheng Li ... · ICASSP 2026
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note ...
Heejoon Koo · arXiv
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Eac...
Huakang Chen, Wenkai Cheng, Guobin Ma ... · arXiv
High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coars...
Keisuke Imoto, Yamato Kojima, Takao Tsuchiya · arXiv
Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impr...
Saturday, May 16, 2026
Huimeng Wang, Hui Lu, Jiajun Deng ... · arXiv
Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech represent...
Yuyang Yan, Sami O. Simons, Visara Urovi · arXiv
Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-i...
Prem Seetharaman, Rithesh Kumar · Proc. ICASSP 2026 · Proc. ICASSP 2026
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a ...
Friday, May 15, 2026
Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds ... · arXiv
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss....
Ningyuan Yang, Yize Li, Diego A. Cuji ... · arXiv
Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey ...
Changheon Han, Ashkan Panahi, Kıvanç Tatar · arXiv
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, ...
Zhongjie Ba, Liang Yi, Peng Cheng ... · arXiv
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key t...
Hidde Folkertsma, Thomas Tienkamp, Sebastiaan de Visscher ... · EMBC 2026
In recent years, the performance of automatic speech recognition (ASR) systems has made considerable progress. Unfortunately, for people with speech impairments, such as people treated for oral cancer (OC), ASR performance is still lagging behind. The scarcity and variability of ...
Yuqing Cheng, Xingyu Ma, Guochen Yu ... · arXiv
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imp...
Sebastian Braun · arXiv
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. Whil...
Thursday, May 14, 2026
Alexander Polok, Ivan Medennikov, Jan Černocký ... · arXiv
Recent breakthroughs in multi-talker ASR (MT-ASR) and speaker diarization (SD) rely on synthetic data to mitigate the scarcity of large-scale conversational recordings, yet the impact of specific simulation choices remains poorly understood. To mind the gap between simulated mixt...
Luis D. Reyes Vargas, Veronica Ruozzi, Andrea K. M. Ross ... · arXiv
Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhan...
Yuyuan Liu, Yuanhong Chen, Chong Wang ... · arXiv
Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches either convert audio into visual prompts (e.g., boxes) via foundation models, or ...
Truong Thanh Hung Nguyen, Vo Thanh Khang Nguyen, Hoang-Loc Cao ... · arXiv
Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argume...
Mohammad Hossein Sameti, Diba Hadi Esfangereh, Sepehr Harfi Moridani ... · arXiv
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising ...
Ryo Magoshi, Takashi Maekaku, Yusuke Shinohara · arXiv
LLM-based automatic speech recognition models demonstrate strong performance by connecting audio encoders and LLMs. However, data scarcity of paired speech and transcription often hinders their adaptation to new domains, making text-only domain adaptation crucial. Existing method...
Dinanath Pathya, Sajen Maharjan, Binita Adhikari ... · arXiv
Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target spee...
Shuyang Cui, Zhi Zhong, Qiyu Wu ... · arXiv
Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for...