Audio ML Papers

Last 7 Days (April 23 - April 30, 2026)

Subcategories: All (29) | Speech Synthesis (6) | Music Synthesis (3) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (2) | Asr (4) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (14)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Chunyu Qiang, Xiaopeng Wang, Kang Yin ... · ACL 2026 main conference
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic d...
#2 TOP PAPER (Score: 88)
Mingchen Shao, Hang Su, Wenjie Tian ... · arXiv
While Large Audio Language Models (LALMs) achieve strong performance on short audio, they degrade on long-form inputs. This degradation is more severe in temporal awareness tasks, where temporal alignment becomes increasingly inaccurate as audio duration grows. We attribute these...
#3 TOP PAPER (Score: 84)
Haopeng Geng, Longfei Yang, Xi Chen ... · arXiv
Mispronunciation Detection and Diagnosis (MDD) requires modeling fine-grained acoustic deviations. However, current ASR-derived MDD systems often face inherent limitations. In particular, CTC-based models favor sequence-level alignments that neglect transient mispronunciation cue...
Tuesday, April 28, 2026
Zhaoyan Pan, Hengyang Zhou, Xiangdong Li ... · arXiv
Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, o...
Chunlei Meng, Jiabin Luo, Pengbin Feng ... · arXiv
Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal interaction. Although effective, both ultim...
Venkata Pushpak Teja Menta · arXiv
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages,...
Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu ... · arXiv
Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- ...
Xuzheng He, Nan Nan, Zhilin Wang ... · arXiv
Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We prese...
Chun-Yi Kuan, Wei-Ping Huang, Hung-yi Lee · arXiv
Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in ...
Kexue Wang, Yinfeng Yu, Liejun Wang · International Conference on Intelligent Computing 2026
To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotion...
Venkata Pushpak Teja Menta · arXiv
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or...
Yichen Wang, Charles Patrick Martin · International Conference on New Interfaces for Musical Expression (NIME) 2026
This performance presents a duet between two intelligent musical instruments, Sù (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to...
Monday, April 27, 2026
Leonardo Haw-Yang Foo, Chih-Kai Yang, Chen-An Li ... · arXiv
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory underst...
Leekyung Kim, Jonghun Park · ICASSP 2026
Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in ...
Wenbin Huang, Yuhang Qiu, Bohan Li ... · arXiv
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to ...
Liang Xu, Diego Caviedes-Nozal, Bastiaan Kleijn ... · arXiv
We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of ...
Sunday, April 26, 2026
Tianyidan Xie, Zhentao Huang, Mingjie Wang ... · ICME 2026
Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framew...
Peize He, Yaodi Luo, Xiaoqian Liu ... · arXiv
Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression metho...
Charles Patrick Martin · International Conference on New Interfaces for Musical Expression (NIME) 2026
Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instru...
Jun Xue, Zhuolin Yi, Yihuan Huang ... · ACL 2026
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions...
Saturday, April 25, 2026
Boxiang Wang, Zhengding Luo, Dongyuan Shi ... · arXiv
Directional Selective Fixed-Filter Active Noise Control (D-SFANC) can effectively attenuate noise from different directions by selecting the suitable pre-trained control filter based on the Direction-of-Arrival (DoA) of the current noise. However, this method is weak at tracking ...
Khalid Zaman, Masashi Unoki · arXiv
Human-imitated speech poses a greater challenge than AI-generated speech for both human listeners and automatic detection systems. Unlike AI-generated speech, which often contains artifacts, over-smoothed spectra, or robotic cues, imitated speech is produced naturally by humans, ...
Friday, April 24, 2026
Li Li, Ming Cheng, Weixin Zhu ... · arXiv
Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent Speech-LLM approaches have shown the potenti...
Maximilian Wachter, Sebastian Murgul, Michael Heizmann · 5th International Conference on SMART MULTIMEDIA (ICSM), 2025
Rhythm transcription is a key subtask of notation-level Automatic Music Transcription (AMT). While deep learning models have been extensively used for detecting the metrical grid in audio and MIDI performances, beat-based rhythm quantization remains largely unexplored. In this wo...
Thursday, April 23, 2026
Chengyou Wang, Hongfei Yue, Guojian Li ... · arXiv
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversatio...
Jialong Mai, Xiaofen Xing, Xiangmin Xu · arXiv
Fine-grained local timing control is still absent from modern text-to-speech systems: existing approaches typically provide only utterance-level duration or global speaking-rate control, while precise token-level timing manipulation remains unavailable. To the best of our knowled...
Noah Jaffe, John Ashley Burgoyne · arXiv
This paper introduces PHOTON (PHysical Optical Tracking of Notes), a non-invasive optical sensing system for measuring key-lever motion in historical keyboard instruments. PHOTON tracks the vertical displacement of the key lever itself, capturing motion shaped by both performer i...
Srija Anand, Ashwin Sankar, Ishvinder Sethi ... · arXiv
Crowdsourced pairwise evaluation has emerged as a scalable approach for assessing foundation models. However, applying it to Text to Speech(TTS) introduces high variance due to linguistic diversity and multidimensional nature of speech perception. We present a controlled multidim...
Ignasi Sole · arXiv
Portamento in string performance has been studied primarily as a binary presence-or-absence phenomenon, with existing research measuring frequency of occurrence and, less commonly, duration in milliseconds. This paper introduces a third quantitative descriptor; the spectrographic...