Audio ML Papers

Last 7 Days (April 16 - April 23, 2026)

Subcategories: All (38) | Speech Synthesis (8) | Music Synthesis (6) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (2) | Asr (3) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (15)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Sihan Lv, Yechen Jin, Zhen Li ... · arXiv
Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-S...
#2 TOP PAPER (Score: 91)
Feiyu Zhao, Yiming Chen, Wenhuan Lu ... · ACL 2026
Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain...
#3 TOP PAPER (Score: 88)
Aoduo Li, Haoran Lv, Shengmin Li ... · ACM ICMR 2026
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge...
Tuesday, April 21, 2026
Lekai Qian, Haoyu Gu, Jingwei Zhao ... · arXiv
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences...
Hyunjung Joo, GyeongTaek Lee · arXiv
The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizatio...
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan ... · arXiv
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) traini...
Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara ... · arXiv
The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cu...
Shuhai Peng, Hui Lu, Jinjiang Liu ... · arXiv
While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation du...
Jianbo Ma, Richard Cartwright · arXiv
Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of...
Monday, April 20, 2026
Deshui Miao, Yameng Gu, Chao Yang ... · arXiv
This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the propose...
Xiang He, Chenxing Li, Jinting Wang ... · arXiv
Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (...
Mason Wang, Cheng-Zhi Anna Huang · arXiv
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking laten...
Ho-Lam Chung, Yiming Chen, Hung-yi Lee · arXiv
Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model...
Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak · ACL Findings (2026)
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often la...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, lea...
Hao Meng, Siyuan Zheng, Shuran Zhou ... · IEEE ICASSP 2026
Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To add...
HaeJun Yoo, Yongseop Shin, Insung Lee ... · ACL 2026 Main Conference
Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment o...
Sunday, April 19, 2026
Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj ... · arXiv
Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this pape...
Girish, Mohd Mujtaba Akhtar, Muskaan Singh · ACL 2026 (main)
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal...
Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang ... · arXiv
Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairnes...
Mohd Mujtaba Akhtar, Girish, Muskaan Singh · ACL 2026
In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern spee...
Saturday, April 18, 2026
Randall Ali, Thomas Dietzen, Matteo Scerbo ... · arXiv
We introduce a new framework for room acoustics modelling based on a state-space model of the boundary integral equation representing the sound field in a room. Whereas state-space models of linear time-invariant systems are traditionally constructed by means of a state vector an...
Yunchong Xiao, Yuxiang Zhao, Ziyang Ma ... · arXiv
The growing reliance on large-scale speech data has made privacy protection a critical concern. However, existing anonymization approaches often degrade data utility, for example by disrupting acoustic continuity or reducing vocal diversity, which compromises the value of speech ...
Friday, April 17, 2026
Jiaxin Ye, Gaoxiang Cong, Chenhui Wang ... · arXiv
Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders ...
Benjamin Chou, Yi Zhu, Surya Koppisetti · arXiv
Audio deepfakes pose a significant security threat, yet current state-of-the-art (SOTA) detection systems do not generalize well to realistic in-the-wild deepfakes. We introduce a novel \textbf{I}n-\textbf{C}ontext \textbf{L}earning paradigm with comparison-guidance for \textbf{A...
Marie Maltais, Yejin Jeon, Min Ma ... · arXiv
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translat...
Xiquan Li, Aurian Quelennec, Slim Essid · arXiv
Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answe...
Tianle Liang, Yifu Chen, Shengpeng Ji ... · ACL 2026 Main Conference
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, ...
Heewon Oh · arXiv
We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M ...
Liumeng Xue, Weizhen Bian, Jiahao Pan ... · arXiv
Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We p...
Thursday, April 16, 2026
Xiaobin Rong, Zheng Wang, Yushi Wang ... · arXiv
Universal speech enhancement (USE) aims to restore speech signals from diverse distortions across multiple sampling rates. We propose UniPASE, an extension of the low-hallucination PASE framework tailored for USE. At its core is DeWavLM-Omni, a unified representation-level enhanc...
Junyi Wang, Chi Zhang, Jing Qian ... · arXiv
In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to ...
Jianxuan Yang, Xinyue Guo, Zhi Cheng ... · arXiv
Recent advances in video-to-audio (V2A) generation enable high-quality audio synthesis from visual content, yet achieving robust and fine-grained controllability remains challenging. Existing methods suffer from weak textual controllability under visual-text conflict and imprecis...
Kunlin Wu, Yanning Wang, Haofeng Tan ... · arXiv
Recent image-to-audio models have shown impressive performance on object-centric visual scenes. However, their application to satellite imagery remains limited by the complex, wide-area semantic ambiguity of top-down views. While satellite imagery provides a uniquely scalable sou...
Jieyi Wang, Yazhe Niu, Dexuan Xu ... · arXiv
Recent Large Audio Language Models have demonstrated impressive capabilities in audio understanding. However, they often suffer from perceptual errors, while reliable audio reasoning is unattainable without first grounding the model's perception in structured auditory scenes. Ins...
Huanran Hu, Zihui Ren, Dingyi Yang ... · arXiv
Real-world video creation often involves a complex reasoning workflow of selecting relevant shots from noisy materials, planning missing shots for narrative completeness, and organizing them into coherent storylines. However, existing benchmarks focus on isolated sub-tasks and la...
Yanda Li, Yuhan Liu, Zirui Song ... · arXiv
Large audio-language models (LALMs) generalize across speech, sound, and music, but unified decoders can exhibit a \emph{temporal smoothing bias}: transient acoustic cues may be underutilized in favor of temporally smooth context that is better supported by language priors, leadi...
Yuxiang Wang, Hongyu Liu, Yijiang Xu ... · arXiv
As speech language models (SLMs) transition from personal devices into shared, multi-user environments, their responses must account for far more than the words alone. Who is speaking, how they sound, and where the conversation takes place can each turn an otherwise benign reques...