Audio ML Papers

Last 7 Days (April 22 - April 29, 2026)

Subcategories: All (25) | Speech Synthesis (3) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (4) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (12)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Chunyu Qiang, Xiaopeng Wang, Kang Yin ... · ACL 2026 main conference
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic d...
#2 TOP PAPER (Score: 88)
Menghe Ma, Siqing Wei, Yuecheng Xing ... · arXiv
Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to br...
#3 TOP PAPER (Score: 88)
Mingchen Shao, Hang Su, Wenjie Tian ... · arXiv
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these...
Monday, April 27, 2026
Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li ... · arXiv
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory underst...
Leekyung Kim, Jonghun Park · ICASSP 2026
Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in ...
Wenbin Huang, Yuhang Qiu, Bohan Li ... · arXiv
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to ...
Liang Xu, Diego Caviedes-Nozal, Bastiaan Kleijn ... · arXiv
We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of ...
Sunday, April 26, 2026
Tianyidan Xie, Zhentao Huang, Mingjie Wang ... · ICME 2026
Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framew...
Peize He, Yaodi Luo, Xiaoqian Liu ... · arXiv
Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression metho...
Charles Patrick Martin · International Conference on New Interfaces for Musical Expression (NIME) 2026
Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instru...
Jun Xue, Zhuolin Yi, Yihuan Huang ... · ACL 2026
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions...
Saturday, April 25, 2026
Boxiang Wang, Zhengding Luo, Dongyuan Shi ... · arXiv
Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking ...
Khalid Zaman, Masashi Unoki · arXiv
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, ...
Friday, April 24, 2026
Haopeng Geng, Longfei Yang, Xi Chen ... · arXiv
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cue...
Li Li, Ming Cheng, Weixin Zhu ... · arXiv
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potenti...
Maximilian Wachter, Sebastian Murgul, Michael Heizmann · 5th International Conference on SMART MULTIMEDIA (ICSM), 2025
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this wo...
Thursday, April 23, 2026
Chengyou Wang, Hongfei Yue, Guojian Li ... · arXiv
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversatio...
Jialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowled...
Noah Jaffe, John Ashley Burgoyne · arXiv
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer i...
Ignasi Sole · arXiv
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic...
Wednesday, April 22, 2026
Zhiyuan Ning, Zhanyong Tang, Xiaojiang Chen ... · arXiv
Voiceprints are widely used for authentication; however, they are easily captured in public settings and cannot be revoked once leaked. Existing anonymization systems operate inside recording devices, which makes them ineffective when microphones or software are untrusted, as in ...
Tong Zhao, Chenghao Zhang, Yutao Zhu ... · arXiv
Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval research has predominantly focused on imag...
Nan Xu, Shiheng Li, Shengchao Hou · arXiv
We propose a new approach for the second stage of a practical two-stage Optical Music Recognition (OMR) pipeline. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphoni...
Paul A. Bereuter, Alois Sontacchi · DAGA 2026 (Annual German Conference on Acoustics)
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test,...
Jiaying Meng, Bojie Li · arXiv
Real-time multimodal agents transport raw audio and screenshots using networking stacks designed for human receivers, which optimize for perceptual fidelity and smooth playout. Yet agent models act as event-driven processors with no inherent sense of physical time, consuming task...