Audio ML Papers

Last 7 Days (April 11 - April 18, 2026)

Subcategories: All (34) | Speech Synthesis (3) | Music Synthesis (6) | Ambient Synthesis (4) | Quality Evaluation (0) | Enhancement (2) | Asr (3) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (15)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar ... · arXiv
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (...
#2 TOP PAPER (Score: 90)
Qi Wang, Zhexu Shen, Meng Chen ... · arXiv
Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and p...
#3 TOP PAPER (Score: 88)
Zeyue Tian, Binxin Yang, Zhaoyang Liu ... · arXiv
Recent progress in multimodal models has spurred rapid advances in audio understanding, generation, and editing. However, these capabilities are typically addressed by specialized models, leaving the development of a truly unified framework that can seamlessly integrate all three...
Thursday, April 16, 2026
Xiaobin Rong, Zheng Wang, Yushi Wang ... · arXiv
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhanc...
Junyi Wang, Chi Zhang, Jing Qian ... · arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
Jianxuan Yang, Xinyue Guo, Zhi Cheng ... · arXiv
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecis...
Kunlin Wu, Yanning Wang, Haofeng Tan ... · arXiv
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable sou...
Jieyi Wang, Yazhe Niu, Dexuan Xu ... · arXiv
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Ins...
Huanran Hu, Zihui Ren, Dingyi Yang ... · arXiv
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and la...
Yuxiang Wang, Hongyu Liu, Yijiang Xu ... · arXiv
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign reques...
Tuesday, April 14, 2026
Longhao Li, Hongjie Chen, Zehan Li ... · arXiv
Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are of...
Tsai-Ning Wang, Herman Teun den Dekker, Lin-Lin Chen ... · AHLI CHIL 2026
Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every i...
Gaoxiang Cong, Liang Li, Jiaxin Ye ... · arXiv
Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. Whi...
Changhao Cheng, Wei Wang, Wangyou Zhang ... · arXiv
Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information ...
James Brooks-Park, Søren Bech, Jan Østergaard ... · The Journal of the Acoustical Society of America, 159(4), 3006-3017 (2026) · The Journal of the Acoustical Society of America
Room compensation aims to improve the accuracy of loudspeaker reproduction in reverberant environments. Traditional methods, however, are limited to improving only spectral (timbral) and temporal accuracy, neglecting the spatial accuracy of loudspeaker reproduction. Proposed is a...
Luoyi Sun, Xiao Zhou, Zeqian Li ... · arXiv
Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from ...
Qixi Zheng, Yuxiang Zhao, Tianrui Wang ... · arXiv
Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challen...
Monday, April 13, 2026
Xi Chen, Wei Xue, Yike Guo · arXiv
Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiti...
Thomas Deppisch · arXiv
Multichannel speech enhancement is widely used as a front-end in microphone array processing systems. While most existing approaches produce a single enhanced signal, direction-preserving multiple-input multiple-output (MIMO) methods instead aim to provide enhanced multichannel s...
Shuiyuan Wang, Zhixian Zhao, Hongfei Yue ... · arXiv
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a compre...
Tao Feng, Yuxiang Wang, Yuancheng Wang ... · arXiv
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but targ...
Jialing Wang, Yue Zhao, Yuhao Zhang ... · arXiv
Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan d...
Xiangyu Zhang, Benjamin John Southwell, Siqi Pan ... · arXiv
Audio tokenization has emerged as a critical component in end-to-end audio language models, enabling efficient discrete representation learning for both audio understanding and generation tasks. However, existing audio tokenizers face fundamental limitations in understanding task...
Zhentao Liu, Milos Cernak · arXiv
The rapid advancement of generative AI has made it increasingly challenging to distinguish between deepfake audio and authentic human speech. To overcome the limitations of passive detection methods, we propose StreamMark, a novel deep learning-based, semi-fragile audio watermark...
Sunday, April 12, 2026
Matteo Spanio, Ilay Guler, Antonio Rodà · arXiv
Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by e...
Toranosuke Manabe, Yuto Shibata, Shinnosuke Takamichi ... · ICPR 2026
Deep learning models have improved sign language-to-text translation and made it easier for non-signers to understand signed messages. When the goal is spoken communication, a naive approach is to convert signed messages into text and then synthesize speech via Text-to-Speech (TT...
Shivam Chauhan, Ajay Pundhir · ICASSP 2026
Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-en...
Qian Zhang, Yuqin Cao, Yixuan Gao ... · arXiv
Video-to-Audio (V2A) generation is essential for immersive multimedia experiences, yet its evaluation remains underexplored. Existing benchmarks typically assess diverse audio types under a unified protocol, overlooking the fine-grained requirements of distinct audio categories. ...
Hongwei Xu · arXiv
MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) n...
Jielin Qiu, Ming Zhu, Wenting Zhao ... · arXiv
Audio-native large language models (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensiv...
Saturday, April 11, 2026
Xingjian Yang, Yudong Yang, Zhixing Guo ... · arXiv
The psychological profile that structurally documents the case of a depression patient is essential for psychotherapy. Large language models can be applied to summarize the profiles from counseling speech, however, it may suffer from long-context forgetting and produce unverifiab...
Hangbin Yu, Yudong Yang, Rongfeng Su ... · arXiv
Automatic depression detection using speech signals with acoustic and textual modalities is a promising approach for early diagnosis. Depression-related patterns exhibit sparsity in speech: diagnostically relevant features occur in specific segments rather than being uniformly di...
Ori Yonay, Tracy Hammond, Tianbao Yang · arXiv
Self-supervised music foundation models underperform on key detection, which requires pitch-sensitive representations. In this work, we present the first systematic study showing that the design of self-supervised pretraining directly impacts pitch sensitivity, and demonstrate th...
Mariano Fernández Méndez · arXiv
Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-spe...