Audio ML Papers

Last 7 Days (April 21 - April 28, 2026)

Subcategories: All (24) | Speech Synthesis (4) | Music Synthesis (3) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (4) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (10)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Feiyu Zhao, Yiming Chen, Wenhuan Lu ... · ACL 2026
Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain...
#2 TOP PAPER (Score: 91)
Chunyu Qiang, Xiaopeng Wang, Kang Yin ... · ACL 2026 main conference
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic d...
#3 TOP PAPER (Score: 88)
Aoduo Li, Haoran Lv, Shengmin Li ... · ACM ICMR 2026
High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge...
Friday, April 24, 2026
Mingchen Shao, Hang Su, Wenjie Tian ... · arXiv
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these...
Haopeng Geng, Longfei Yang, Xi Chen ... · arXiv
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cue...
Li Li, Ming Cheng, Weixin Zhu ... · arXiv
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potenti...
Maximilian Wachter, Sebastian Murgul, Michael Heizmann · 5th International Conference on SMART MULTIMEDIA (ICSM), 2025
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this wo...
Thursday, April 23, 2026
Chengyou Wang, Hongfei Yue, Guojian Li ... · arXiv
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversatio...
Jialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowled...
Noah Jaffe, John Ashley Burgoyne · arXiv
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer i...
Ignasi Sole · arXiv
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic...
Wednesday, April 22, 2026
Menghe Ma, Siqing Wei, Yuecheng Xing ... · arXiv
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to br...
Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen ... · arXiv
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in ...
Tong Zhao, Chenghao Zhang, Yutao Zhu ... · arXiv
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on imag...
Nan Xu, Shiheng Li, Shengchao Hou · arXiv
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphoni...
Paul A. Bereuter, Alois Sontacchi · DAGA 2026 (Annual German Conference on Acoustics)
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test,...
Jiaying Meng, Bojie Li · arXiv
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task...
Tuesday, April 21, 2026
Lekai Qian, Haoyu Gu, Jingwei Zhao ... · arXiv
Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences...
Hyunjung Joo, GyeongTaek Lee · arXiv
The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizatio...
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan ... · arXiv
Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) traini...
Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara ... · arXiv
The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cu...
Shuhai Peng, Hui Lu, Jinjiang Liu ... · arXiv
While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation du...
Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan ... · ACL 2026
The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CFs). Consequently, CF detection has attracted increasing attention from the resea...
Jianbo Ma, Richard Cartwright · arXiv
Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of...