Audio ML Papers

Last 7 Days (April 18 - April 25, 2026)

Subcategories: All (32) | Speech Synthesis (5) | Music Synthesis (6) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (3) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (15)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Feiyu Zhao, Yiming Chen, Wenhuan Lu ... · ACL 2026
Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain...
#2 TOP PAPER (Score: 88)
Aoduo Li, Haoran Lv, Shengmin Li ... · ACM ICMR 2026
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge...
#3 TOP PAPER (Score: 88)
Menghe Ma, Siqing Wei, Yuecheng Xing ... · arXiv
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to br...
Thursday, April 23, 2026
Chengyou Wang, Hongfei Yue, Guojian Li ... · arXiv
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversatio...
Jialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowled...
Noah Jaffe, John Ashley Burgoyne · arXiv
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer i...
Wednesday, April 22, 2026
Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen ... · arXiv
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in ...
Tong Zhao, Chenghao Zhang, Yutao Zhu ... · arXiv
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on imag...
Nan Xu, Shiheng Li, Shengchao Hou · arXiv
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphoni...
Paul A. Bereuter, Alois Sontacchi · DAGA 2026 (Annual German Conference on Acoustics)
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test,...
Jiaying Meng, Bojie Li · arXiv
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task...
Tuesday, April 21, 2026
Lekai Qian, Haoyu Gu, Jingwei Zhao ... · arXiv
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences...
Hyunjung Joo, GyeongTaek Lee · arXiv
The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizatio...
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan ... · arXiv
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) traini...
Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara ... · arXiv
The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cu...
Shuhai Peng, Hui Lu, Jinjiang Liu ... · arXiv
While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation du...
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan ... · ACL 2026
The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the resea...
Jianbo Ma, Richard Cartwright · arXiv
Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of...
Monday, April 20, 2026
Deshui Miao, Yameng Gu, Chao Yang ... · arXiv
This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the propose...
Xiang He, Chenxing Li, Jinting Wang ... · arXiv
Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (...
Mason Wang, Cheng-Zhi Anna Huang · arXiv
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking laten...
Ho-Lam Chung, Yiming Chen, Hung-yi Lee · arXiv
Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model...
Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak · ACL Findings (2026)
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often la...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, lea...
Hao Meng, Siyuan Zheng, Shuran Zhou ... · IEEE ICASSP 2026
Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To add...
HaeJun Yoo, Yongseop Shin, Insung Lee ... · ACL 2026 Main Conference
Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment o...
Sunday, April 19, 2026
Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj ... · arXiv
Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this pape...
Girish, Mohd Mujtaba Akhtar, Muskaan Singh · ACL 2026 (main)
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal...
Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang ... · arXiv
Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairnes...
Mohd Mujtaba Akhtar, Girish, Muskaan Singh · ACL 2026
In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern spee...
Saturday, April 18, 2026
Randall Ali, Thomas Dietzen, Matteo Scerbo ... · arXiv
We introduce a new framework for room acoustics modelling based on a state-space model of the boundary integral equation representing the sound field in a room. Whereas state-space models of linear time-invariant systems are traditionally constructed by means of a state vector an...
Yunchong Xiao, Yuxiang Zhao, Ziyang Ma ... · arXiv
The growing reliance on large-scale speech data has made privacy protection a critical concern. However, existing anonymization approaches often degrade data utility, for example by disrupting acoustic continuity or reducing vocal diversity, which compromises the value of speech ...