Audio ML Papers

Last 7 Days (May 15 - May 22, 2026)

Subcategories: All (48) | Speech Synthesis (4) | Music Synthesis (9) | Ambient Synthesis (4) | Quality Evaluation (0) | Enhancement (3) | Asr (6) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (19)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Masao, Someki, Chien-yu ... · Findings of ACL 2026
Long-form audio understanding poses significant challenges for large audio language models (LALMs) due to the extreme length of audio sequences and the need to reason over heterogeneous acoustic cues distributed over time, such as speech content, speaker identity, emotion, and so...
#2 TOP PAPER (Score: 91)
Feiyan Zhou, Luyuan Wang, Shoufa Chen ... · arXiv
Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without...
#3 TOP PAPER (Score: 91)
Jaechul Roh, Jean-Philippe Monteuuis, Jonathan Petit ... · arXiv
Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both de...
Wednesday, May 20, 2026
Haoyang Zhang, Jun Chen, Donghang Wu ... · arXiv
Recent advances in spoken dialogue language models have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and too...
Hongrui Zhang, Daiqing Wu, Yangyang Li ... · IJCAI 2026 Survey Track
Multimodal Emotion Recognition (MER) focuses on identifying and interpreting emotions from modality-compound inputs. Closely mirroring human cognitive processes in real-world environments, MER has drawn substantial attention from both academia and industry. Recently, a paradigm s...
Zhihan Guo, Wenqian Cui, Guan-Ting Lin ... · arXiv
Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, ...
Junyoung Koh · arXiv
Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isola...
Ilai Zaidel, Ori Engel, Bar Engel ... · arXiv
We propose a deep beamforming framework for enhancing target speaker(s) in multi-speaker environments. A deep neural network (DNN) is trained to estimate beamforming weights directly from noisy multichannel inputs while satisfying linear spatial constraints through an adaptive mu...
Semin Kim, Seungjun Chung, Taehong Moon ... · arXiv
Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-o...
Shinnosuke Taksuka, Hideo Mukai · arXiv
This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as...
Tuesday, May 19, 2026
Zhifei Xie, Kaiyu Pang, Haobin Zhang ... · arXiv
Despite rapid advances in automatic speech recognition (ASR) and large audio-language models, robust recognition in real-world environments remains limited by an "acoustic robustness bottleneck": models often lose acoustic grounding and produce omissions or hallucinations under s...
Junyi Wang, Chi Zhang, Jing Qian ... · arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
David Sundström, Filip Tronarp, Johan Lindström ... · arXiv
In sound field control applications, it is commonly assumed that one has access to an accurate representation of the sound field in the region of interest. This is a problematic assumption since the reconstruction of a sound field from available microphone measurements is especia...
Zhong-Qiu Wang, Samuele Cornell · arXiv
In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signa...
Silvan Peter, Patricia Hu, Gerhard Widmer · Music Encoding Conference (MEC) 2026
Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparabl...
Zijie Xin, Jie Yang, Ruixiang Zhao ... · arXiv
Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial comp...
Andreas Triantafyllopoulos, Jakub Šťastný, Alexios Terpinas ... · arXiv
Reinforcement learning is a powerful learning paradigm that has spearheaded progress in numerous domains. Its core promise lies in learning through high-level goals without the need for granular labels. However, it still remains elusive in the realm of audio, where it has receive...
Hirotaka Nishikori, Nobutaka Ito, Kouei Yamaoka ... · arXiv
Distributed microphone arrays composed of multiple subarrays enable blind source separation over a wide spatial area. Directly applying fast multichannel nonnegative matrix factorization (FastMNMF) to all subarrays can exploit observations from all subarrays, but it requires repe...
Ling Qi, Aleksandra Teng Ma, Alexandria Smith · International Computer Music Conference (ICMC) 2026
The I-Ching is one of the most influential texts in Chinese intellectual history, integrating divination, cosmology, and ethical reflection. While Western experimental music, most notably John Cage, has drawn on the I-Ching as a source of chance operation, such appropriations hav...
Monday, May 18, 2026
Kaiwen Luo, Zhenhong Zhou, Leo Wang ... · arXiv
The foundational capabilities established by Large Language Models (LLMs) have paved the way for Multimodal Large Language Models (MLLMs), within which Large Audio Language Models (LALMs) are essential for realizing universal auditory intelligence. Despite their remarkable perfor...
Kai-Chen Tsai, Tien-Hong Lo, Yun-Ting Sun ... · arXiv
Contextual biasing is essential to improving the recognition of rare and domain-specific words in an automatic speech recognition (ASR) system. While numerous methods have been proposed in recent years, most of them focus on offline settings and do not explicitly address the chal...
Gyubin Lee, Junwon Lee, Juhan Nam · accepted to CVPR 2026 Workshop on Sight and Sound
We investigate Counterfactual Video Foley Generation, which aims to adopt a sound-source identity that contradicts the visual evidence while remaining temporally synchronized to a silent video. Existing Video&Text-to-Audio (VT2A) models struggle with this, often remaining anchore...
Hengyan Huang, Xiaoxuan Guo, Jiayi Zhou ... · arXiv
ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for ...
Jiatong Li, Wiebke Middelberg, Simon Doclo · arXiv
Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades sign...
Jianhong Ye, Haiquan Zhao, Shaohui Lv ... · arXiv
The conventional normalized subband p-norm (NSPN) algorithm achieves robustness in $α$-stable noise ($1<α\leq 2$) by utilizing low-order error moments. However, its performance degrades significantly under three scenarios: (1) non-Gaussian inputs, (2) $α$-stable noise with $0<α\l...
Yanru Wu, Jianning Wang, Chongxin Gan ... · arXiv
Training general-purpose Audio Large Language Models (ALLMs) across diverse datasets is essential for holistic audio understanding, yet it faces significant challenges due to dataset heterogeneity, which often leads to conflicting gradients and slow convergence. Despite its impac...
Chaolei Han, Hongsong Wang, Jie Gui · ICML 2026
Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degrada...
Jun Xue, Tong Zhang, Zhuolin Yi ... · IJCAI 2026
The rapid advancement of generative AI has made audio deepfakes increasingly indistinguishable from authentic human vocals, posing significant threats to persons-of-interest (POI) such as public figures. Current detection systems primarily rely on generic, black-box models that f...
Julian D. Parker, Zach Evans, CJ Carr ... · arXiv
Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and g...
Jing Zhao, KokSheik Wong, Vishnu Monn Baskaran ... · arXiv
The sonata form is a musically rich and hierarchically structured form that poses significant challenges for automatic analysis. While music structure analysis has seen strides of progress in recent years, sonata form analysis remains in its early stages. This is largely due to t...
Zach Evans, Julian D. Parker, Matthew Rice ... · arXiv
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations f...
Attia Nafees ul Haq, Zeyu Zhu, Jingbin Hu ... · arXiv
Despite 230 million speakers, Urdu remains critically under-resourced in speech technology. We introduce UrduSpeech: a large high-fidelity Urdu corpus comprising 156 hours of audio with 12-dimension paralinguistic metadata, encompassing US-Std, US-CS, US-EngPk. To address Right-t...
Md Hasan, Nyvenn Castro, Daiqi Liu ... · arXiv
Real-time magnetic resonance imaging (rtMRI) of speech production enables non-invasive visualization of dynamic vocal-tract motion and is valuable for speech science and clinical assessment. However, rtMRI is fundamentally constrained by trade-offs among spatial resolution, tempo...
Sunday, May 17, 2026
Yuanbo Hou, Zhaoyi Liu, Tong Ye ... · arXiv
Weakly labeled datasets such as AudioSet have driven recent progress in audio tagging. However, annotation quality varies across sound classes. Labels may be incomplete, ambiguous, or unreliable, which introduces class-dependent supervision bias during optimisation. The issue bec...
Weixing Wei, Raynaldi Lalang, Dichucheng Li ... · ICASSP 2026
This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note ...
Heejoon Koo · arXiv
Robust selective auditory attention under multilingual interference is critical for reliable deployment of Large Audio Language Models (LALMs). We introduce MUSA, a cocktail party-inspired multilingual benchmark for source-grounded spoken-language understanding and reasoning. Eac...
Huakang Chen, Wenkai Cheng, Guobin Ma ... · arXiv
High-fidelity text-to-music generation typically relies on massive proprietary datasets and immense computational resources. Existing models often struggle to generate coherent pure musical accompaniments and lack precise, localized semantic control due to their reliance on coars...
Keisuke Imoto, Yamato Kojima, Takao Tsuchiya · arXiv
Finding sound effects or environmental sounds that match a creator's intended impression remains a largely manual process in multimedia production. This is especially relevant for comics and other visual media, where visually stylized onomatopoeic expressions convey auditory impr...
Saturday, May 16, 2026
Huimeng Wang, Hui Lu, Jiajun Deng ... · arXiv
Continuous autoregressive speech synthesis has recently emerged as a promising direction for zero-shot text-to-speech (TTS). However, existing methods still suffer from a fundamental mismatch between semantic-prosodic modeling and reconstruction-driven continuous speech represent...
Yuyang Yan, Sami O. Simons, Visara Urovi · arXiv
Early detection of exacerbations in asthma and chronic obstructive pulmonary disease (COPD) is important for timely intervention. Speech has emerged as a promising tool for continuous, non-invasive respiratory disease monitoring. However, speech signals inherently carry speaker-i...
Prem Seetharaman, Rithesh Kumar · Proc. ICASSP 2026 · Proc. ICASSP 2026
Latent diffusion models have emerged as the dominant paradigm for many generation tasks including audio generation such as text-to-audio, text-to-music and text-to-speech. A key component of latent diffusion is an autoencoder (VAE) that compresses high-dimensional signals into a ...
Friday, May 15, 2026
Kaitlyn Zhou, Federico Bianchi, Martijn Bartelds ... · arXiv
Artificially generated speech is increasingly embedded in everyday life. Voice cloning in particular enables applications where identity preservation is important, such as completing a recording, dubbing in a new language, or preserving the voices of individuals with speech loss....
Ningyuan Yang, Yize Li, Diego A. Cuji ... · arXiv
Audio super-resolution (SR), also referred to as bandwidth extension (BWE), aims to reconstruct high-fidelity signals from low-resolution (LR) or band-limited (BL) observations, an inherently ill-posed task due to the ambiguity of missing high-frequency (HF) content. This survey ...
Changheon Han, Ashkan Panahi, Kıvanç Tatar · arXiv
Training data attribution (TDA) for music generation must answer two questions that copyright analysis requires, namely which training songs influence a generated output and along which musical aspects the influence operates. Existing methods reduce influence to a single scalar, ...
Zhongjie Ba, Liang Yi, Peng Cheng ... · arXiv
Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key t...
Hidde Folkertsma, Thomas Tienkamp, Sebastiaan de Visscher ... · EMBC 2026
In recent years, the performance of automatic speech recognition (ASR) systems has made considerable progress. Unfortunately, for people with speech impairments, such as people treated for oral cancer (OC), ASR performance is still lagging behind. The scarcity and variability of ...
Yuqing Cheng, Xingyu Ma, Guochen Yu ... · arXiv
Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imp...
Sebastian Braun · arXiv
Generative models are capable to address difficult problems with non-unique solutions like bandwidth extension and gap filling, removing highly non-linear artifacts from codecs, clipping and distortion, as opposed to removing linear additive components like noise and reverb. Whil...