Audio ML Papers

Last 7 Days (April 15 - April 22, 2026)

Subcategories: All (14) | Speech Synthesis (4) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (4)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Sihan Lv, Yechen Jin, Zhen Li ... · arXiv
Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-S...
#2 TOP PAPER (Score: 85)
Xiaobin Rong, Zheng Wang, Yushi Wang ... · arXiv
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhanc...
#3 TOP PAPER (Score: 84)
Junyi Wang, Chi Zhang, Jing Qian ... · arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
Friday, April 17, 2026
Jiaxin Ye, Gaoxiang Cong, Chenhui Wang ... · arXiv
Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders ...
Marie Maltais, Yejin Jeon, Min Ma ... · arXiv
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translat...
Xiquan Li, Aurian Quelennec, Slim Essid · arXiv
Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answe...
Tianle Liang, Yifu Chen, Shengpeng Ji ... · ACL 2026 Main Conference
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, ...
Liumeng Xue, Weizhen Bian, Jiahao Pan ... · arXiv
Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We p...
Thursday, April 16, 2026
Jianxuan Yang, Xinyue Guo, Zhi Cheng ... · arXiv
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecis...
Kunlin Wu, Yanning Wang, Haofeng Tan ... · arXiv
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable sou...
Jieyi Wang, Yazhe Niu, Dexuan Xu ... · arXiv
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Ins...
Huanran Hu, Zihui Ren, Dingyi Yang ... · arXiv
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and la...
Yanda Li, Yuhan Liu, Zirui Song ... · arXiv
Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a \emph{temporal smoothing bias}: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leadi...
Yuxiang Wang, Hongyu Liu, Yijiang Xu ... · arXiv
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign reques...