Audio ML Papers

Last 7 Days (April 13 - April 20, 2026)

Subcategories: All (23) | Speech Synthesis (2) | Music Synthesis (3) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (2) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (11)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Sreyan Ghosh, Arushi Goel, Kaousheik Jayakumar ... · arXiv
We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (...
#2 TOP PAPER (Score: 90)
Qi Wang, Zhexu Shen, Meng Chen ... · arXiv
Vocal-to-accompaniment (V2A) generation, which aims to transform a raw vocal recording into a fully arranged accompaniment, inherently requires jointly addressing an accompaniment trilemma: preserving acoustic authenticity, maintaining global coherence with the vocal track, and p...
#3 TOP PAPER (Score: 85)
Longhao Li, Hongjie Chen, Zehan Li ... · arXiv
Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are of...
Thursday, April 16, 2026
Xiaobin Rong, Zheng Wang, Yushi Wang ... · arXiv
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhanc...
Junyi Wang, Chi Zhang, Jing Qian ... · arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
Jianxuan Yang, Xinyue Guo, Zhi Cheng ... · arXiv
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecis...
Kunlin Wu, Yanning Wang, Haofeng Tan ... · arXiv
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable sou...
Jieyi Wang, Yazhe Niu, Dexuan Xu ... · arXiv
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Ins...
Huanran Hu, Zihui Ren, Dingyi Yang ... · arXiv
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and la...
Yuxiang Wang, Hongyu Liu, Yijiang Xu ... · arXiv
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign reques...
Tuesday, April 14, 2026
Tsai-Ning Wang, Herman Teun den Dekker, Lin-Lin Chen ... · AHLI CHIL 2026
Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every i...
Gaoxiang Cong, Liang Li, Jiaxin Ye ... · arXiv
Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. Whi...
Changhao Cheng, Wei Wang, Wangyou Zhang ... · arXiv
Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information ...
James Brooks-Park, Søren Bech, Jan Østergaard ... · The Journal of the Acoustical Society of America, 159(4), 3006-3017 (2026) · The Journal of the Acoustical Society of America
Room compensation aims to improve the accuracy of loudspeaker reproduction in reverberant environments. Traditional methods, however, are limited to improving only spectral (timbral) and temporal accuracy, neglecting the spatial accuracy of loudspeaker reproduction. Proposed is a...
Luoyi Sun, Xiao Zhou, Zeqian Li ... · arXiv
Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from ...
Qixi Zheng, Yuxiang Zhao, Tianrui Wang ... · arXiv
Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challen...
Monday, April 13, 2026
Xi Chen, Wei Xue, Yike Guo · arXiv
Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiti...
Thomas Deppisch · arXiv
Multichannel speech enhancement is widely used as a front-end in microphone array processing systems. While most existing approaches produce a single enhanced signal, direction-preserving multiple-input multiple-output (MIMO) methods instead aim to provide enhanced multichannel s...
Shuiyuan Wang, Zhixian Zhao, Hongfei Yue ... · arXiv
Evaluating the emotional intelligence (EI) of audio language models (ALMs) is critical. However, existing benchmarks mostly rely on synthesized speech, are limited to single-turn interactions, and depend heavily on open-ended scoring. This paper proposes HumDial-EIBench, a compre...
Tao Feng, Yuxiang Wang, Yuancheng Wang ... · arXiv
Voice imitation aims to transform source speech to match a reference speaker's timbre and speaking style while preserving linguistic content. A straightforward approach is to train on triplets of (source, reference, target), where source and target share the same content but targ...
Jialing Wang, Yue Zhao, Yuhao Zhang ... · arXiv
Recent advances in Speech Large Language Models (Speech-LLMs) have made significant progress, greatly enhancing multimodal interaction capabilities.However, their application in low-resource and dialect-diverse environments still faces challenges. The severe scarcity of Tibetan d...
Xiangyu Zhang, Benjamin John Southwell, Siqi Pan ... · arXiv
Audio tokenization has emerged as a critical component in end-to-end audio language models, enabling efficient discrete representation learning for both audio understanding and generation tasks. However, existing audio tokenizers face fundamental limitations in understanding task...
Zhentao Liu, Milos Cernak · arXiv
The rapid advancement of generative AI has made it increasingly challenging to distinguish between deepfake audio and authentic human speech. To overcome the limitations of passive detection methods, we propose StreamMark, a novel deep learning-based, semi-fragile audio watermark...