Audio ML Papers

Week of April 19 - April 26, 2026

Subcategories: All (38) | Speech Synthesis (5) | Music Synthesis (6) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (6) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (18)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Feiyu Zhao, Yiming Chen, Wenhuan Lu ... · ACL 2026
Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain...
#2 TOP PAPER (Score: 91)
Chunyu Qiang, Xiaopeng Wang, Kang Yin ... · ACL 2026 main conference
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic d...
#3 TOP PAPER (Score: 88)
Aoduo Li, Haoran Lv, Shengmin Li ... · ACM ICMR 2026
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge...
Saturday, April 25, 2026
Boxiang Wang, Zhengding Luo, Dongyuan Shi ... · arXiv
Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking ...
Khalid Zaman, Masashi Unoki · arXiv
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, ...
Friday, April 24, 2026
Mingchen Shao, Hang Su, Wenjie Tian ... · arXiv
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these...
Haopeng Geng, Longfei Yang, Xi Chen ... · arXiv
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cue...
Li Li, Ming Cheng, Weixin Zhu ... · arXiv
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potenti...
Maximilian Wachter, Sebastian Murgul, Michael Heizmann · 5th International Conference on SMART MULTIMEDIA (ICSM), 2025
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this wo...
Thursday, April 23, 2026
Chengyou Wang, Hongfei Yue, Guojian Li ... · arXiv
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversatio...
Jialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowled...
Noah Jaffe, John Ashley Burgoyne · arXiv
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer i...
Ignasi Sole · arXiv
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic...
Wednesday, April 22, 2026
Menghe Ma, Siqing Wei, Yuecheng Xing ... · arXiv
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to br...
Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen ... · arXiv
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in ...
Tong Zhao, Chenghao Zhang, Yutao Zhu ... · arXiv
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on imag...
Nan Xu, Shiheng Li, Shengchao Hou · arXiv
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphoni...
Paul A. Bereuter, Alois Sontacchi · DAGA 2026 (Annual German Conference on Acoustics)
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test,...
Jiaying Meng, Bojie Li · arXiv
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task...
Tuesday, April 21, 2026
Lekai Qian, Haoyu Gu, Jingwei Zhao ... · arXiv
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences...
Hyunjung Joo, GyeongTaek Lee · arXiv
The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizatio...
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan ... · arXiv
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) traini...
Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara ... · arXiv
The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cu...
Shuhai Peng, Hui Lu, Jinjiang Liu ... · arXiv
While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation du...
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan ... · ACL 2026
The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the resea...
Jianbo Ma, Richard Cartwright · arXiv
Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of...
Monday, April 20, 2026
Deshui Miao, Yameng Gu, Chao Yang ... · arXiv
This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS\_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the propose...
Xiang He, Chenxing Li, Jinting Wang ... · arXiv
Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (...
Mason Wang, Cheng-Zhi Anna Huang · arXiv
We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking laten...
Ho-Lam Chung, Yiming Chen, Hung-yi Lee · arXiv
Neural audio codecs are widely used as tokenizers for spoken language models, but they are optimized for waveform reconstruction rather than autoregressive prediction. This mismatch injects acoustically driven uncertainty into the discrete token space and increases language-model...
Samuel G. Balter, Ethan Jerzak, Connor T. Jerzak · ACL Findings (2026)
Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often la...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a mainstream paradigm in recent years. Although existing LLM-based ASR models demonstrate impressive performance on public benchmarks, their training remains predominantly data-driven, lea...
Hao Meng, Siyuan Zheng, Shuran Zhou ... · IEEE ICASSP 2026
Large Language Models (LLMs) show promise in lyric-to-melody generation, but models trained with Supervised Fine-Tuning (SFT) often produce musically implausible melodies with issues like poor rhythm and unsuitable vocal ranges, a phenomenon we term "constraint violation". To add...
HaeJun Yoo, Yongseop Shin, Insung Lee ... · ACL 2026 Main Conference
Audio-text retrieval systems based on Contrastive Language-Audio Pretraining (CLAP) achieve strong performance on traditional benchmarks; however, these benchmarks rely on caption-style queries that differ substantially from real-world search behavior, limiting their assessment o...
Sunday, April 19, 2026
Vaibhavi Lokegaonkar, Aryan Vijay Bhosale, Vishnu Raj ... · arXiv
Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this pape...
Girish, Mohd Mujtaba Akhtar, Muskaan Singh · ACL 2026 (main)
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal...
Yi-Cheng Lin, Yusuke Hirota, Sung-Feng Huang ... · arXiv
Large Audio-Language Models (LALMs) are increasingly integrated into daily applications, yet their generative biases remain underexplored. Existing speech fairness benchmarks rely on synthetic speech and Multiple-Choice Questions (MCQs), both offering a fragmented view of fairnes...
Mohd Mujtaba Akhtar, Girish, Muskaan Singh · ACL 2026
In this study, we present Healthcare Codec-Fake Detection (HCFD), a new task for detecting codec-fakes under pathological speech conditions. We intentionally focus on codec based synthetic speech in this work, since neural codec decoding forms a core building block in modern spee...