Audio ML Papers

Last 7 Days (April 14 - April 21, 2026)

Subcategories: All (21) | Speech Synthesis (6) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (9)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Sihan Lv, Yechen Jin, Zhen Li ... ¡ arXiv
Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-S...
#2 TOP PAPER (Score: 85)
Longhao Li, Hongjie Chen, Zehan Li ... ¡ arXiv
Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are of...
#3 TOP PAPER (Score: 85)
Xiaobin Rong, Zheng Wang, Yushi Wang ... ¡ arXiv
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhanc...
Friday, April 17, 2026
Jiaxin Ye, Gaoxiang Cong, Chenhui Wang ... ¡ arXiv
Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders ...
Marie Maltais, Yejin Jeon, Min Ma ... ¡ arXiv
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translat...
Xiquan Li, Aurian Quelennec, Slim Essid ¡ arXiv
Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answe...
Tianle Liang, Yifu Chen, Shengpeng Ji ... ¡ ACL 2026 Main Conference
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, ...
Liumeng Xue, Weizhen Bian, Jiahao Pan ... ¡ arXiv
Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We p...
Thursday, April 16, 2026
Junyi Wang, Chi Zhang, Jing Qian ... ¡ arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
Jianxuan Yang, Xinyue Guo, Zhi Cheng ... ¡ arXiv
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecis...
Kunlin Wu, Yanning Wang, Haofeng Tan ... ¡ arXiv
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable sou...
Jieyi Wang, Yazhe Niu, Dexuan Xu ... ¡ arXiv
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Ins...
Huanran Hu, Zihui Ren, Dingyi Yang ... ¡ arXiv
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and la...
Yanda Li, Yuhan Liu, Zirui Song ... ¡ arXiv
Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a \emph{temporal smoothing bias}: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leadi...
Yuxiang Wang, Hongyu Liu, Yijiang Xu ... ¡ arXiv
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign reques...
Tuesday, April 14, 2026
Tsai-Ning Wang, Herman Teun den Dekker, Lin-Lin Chen ... ¡ AHLI CHIL 2026
Automated respiratory audio analysis promises scalable, non-invasive disease screening, yet progress is limited by scarce labeled data and costly expert annotation. Zero-shot inference eliminates task-specific supervision, but existing methods apply uniform computation to every i...
Gaoxiang Cong, Liang Li, Jiaxin Ye ... ¡ arXiv
Movie dubbing aims to synthesize speech that preserves the vocal identity of a reference audio while synchronizing with the lip movements in a target video. Existing methods fail to achieve precise lip-sync and lack naturalness due to explicit alignment at the duration level. Whi...
Changhao Cheng, Wei Wang, Wangyou Zhang ... ¡ arXiv
Continuous speech representations based on Variational Autoencoders (VAEs) have emerged as a promising alternative to traditional spectrogram or discrete token based features for speech generation and reconstruction. Recent research has tried to enrich the structural information ...
James Brooks-Park, Søren Bech, Jan Østergaard ... ¡ The Journal of the Acoustical Society of America, 159(4), 3006-3017 (2026) ¡ The Journal of the Acoustical Society of America
Room compensation aims to improve the accuracy of loudspeaker reproduction in reverberant environments. Traditional methods, however, are limited to improving only spectral (timbral) and temporal accuracy, neglecting the spatial accuracy of loudspeaker reproduction. Proposed is a...
Luoyi Sun, Xiao Zhou, Zeqian Li ... ¡ arXiv
Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from ...
Qixi Zheng, Yuxiang Zhao, Tianrui Wang ... ¡ arXiv
Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challen...