Audio ML Papers

Last 7 Days (April 17 - April 24, 2026)

Subcategories: All (36) | Speech Synthesis (8) | Music Synthesis (7) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (3) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (15)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Sihan Lv, Yechen Jin, Zhen Li ... · arXiv
Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-S...
#2 TOP PAPER (Score: 91)
Feiyu Zhao, Yiming Chen, Wenhuan Lu ... · ACL 2026
Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain...
#3 TOP PAPER (Score: 88)
Aoduo Li, Haoran Lv, Shengmin Li ... · ACM ICMR 2026
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge...
Wednesday, April 22, 2026
Menghe Ma, Siqing Wei, Yuecheng Xing ... · arXiv
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to br...
Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen ... · arXiv
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in ...
Tong Zhao, Chenghao Zhang, Yutao Zhu ... · arXiv
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on imag...
Nan Xu, Shiheng Li, Shengchao Hou · arXiv
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphoni...
Paul A. Bereuter, Alois Sontacchi · DAGA 2026 (Annual German Conference on Acoustics)
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test,...
Tuesday, April 21, 2026
Lekai Qian, Haoyu Gu, Jingwei Zhao ... · arXiv
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences...
Hyunjung Joo, GyeongTaek Lee · arXiv
The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizatio...
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan ... · arXiv
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) traini...
Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara ... · arXiv
The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cu...
Shuhai Peng, Hui Lu, Jinjiang Liu ... · arXiv
While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation du...
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan ... · ACL 2026
The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the resea...
Jianbo Ma, Richard Cartwright · arXiv
Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of...
Monday, April 20, 2026
Deshui Miao, Yameng Gu, Chao Yang ... · arXiv
This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the propose...
Xiang He, Chenxing Li, Jinting Wang ... · arXiv
Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (...
Mason Wang, Cheng-Zhi Anna Huang · arXiv
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking laten...
Ho-Lam Chung, Yiming Chen, Hung-yi Lee · arXiv
Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model...
Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak · ACL Findings (2026)
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often la...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, lea...
Hao Meng, Siyuan Zheng, Shuran Zhou ... · IEEE ICASSP 2026
Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To add...
HaeJun Yoo, Yongseop Shin, Insung Lee ... · ACL 2026 Main Conference
Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment o...
Sunday, April 19, 2026
Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj ... · arXiv
Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this pape...
Girish, Mohd Mujtaba Akhtar, Muskaan Singh · ACL 2026 (main)
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal...
Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang ... · arXiv
Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairnes...
Mohd Mujtaba Akhtar, Girish, Muskaan Singh · ACL 2026
In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern spee...
Saturday, April 18, 2026
Randall Ali, Thomas Dietzen, Matteo Scerbo ... · arXiv
We introduce a new framework for room acoustics modelling based on a state-space model of the boundary integral equation representing the sound field in a room. Whereas state-space models of linear time-invariant systems are traditionally constructed by means of a state vector an...
Yunchong Xiao, Yuxiang Zhao, Ziyang Ma ... · arXiv
The growing reliance on large-scale speech data has made privacy protection a critical concern. However, existing anonymization approaches often degrade data utility, for example by disrupting acoustic continuity or reducing vocal diversity, which compromises the value of speech ...
Friday, April 17, 2026
Jiaxin Ye, Gaoxiang Cong, Chenhui Wang ... · arXiv
Video-to-Speech (VTS) generation aims to synthesize speech from a silent video without auditory signals. However, existing VTS methods disregard the hierarchical nature of speech, which spans coarse speaker-aware semantics to fine-grained prosodic details. This oversight hinders ...
Benjamin Chou, Yi Zhu, Surya Koppisetti · arXiv
Audio deepfakes pose a significant security threat, yet current state-of-the-art (SOTA) detection systems do not generalize well to realistic in-the-wild deepfakes. We introduce a novel \textbf{I}n-\textbf{C}ontext \textbf{L}earning paradigm with comparison-guidance for \textbf{A...
Marie Maltais, Yejin Jeon, Min Ma ... · arXiv
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translat...
Xiquan Li, Aurian Quelennec, Slim Essid · arXiv
Music understanding and reasoning are central challenges in the Music Information Research field, with applications ranging from retrieval and recommendation to music agents and virtual assistants. Recent Large Audio-Language Models (LALMs) have shown remarkable progress in answe...
Tianle Liang, Yifu Chen, Shengpeng Ji ... · ACL 2026 Main Conference
Recent end-to-end spoken dialogue models enable natural interaction. However, as user demands become increasingly complex, models that rely solely on conversational abilities often struggle to cope. Incorporating agentic capabilities is therefore essential: by enabling tool use, ...
Heewon Oh · arXiv
We present ArtifactNet, a lightweight framework that detects AI-generated music by reframing the problem as forensic physics -- extracting and analyzing the physical artifacts that neural audio codecs inevitably imprint on generated audio. A bounded-mask UNet (ArtifactUNet, 3.6M ...
Liumeng Xue, Weizhen Bian, Jiahao Pan ... · arXiv
Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We p...