Audio ML Papers

Week of December 14 - December 21, 2025

Subcategories: All (19) | Speech Synthesis (4) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (3) | Asr (0) | Other (8)
← Previous Week | Next Week → | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 83)
Menglu Li, Majd Alber, Ramtin Asgarianamiri ... ยท arXiv
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated syn...
#2 TOP PAPER (Score: 83)
Jiayan Cui, Zhihan Yang, Naihan Li ... ยท arXiv
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only...
#3 TOP PAPER (Score: 83)
Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K ยท arXiv
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-te...
Saturday, December 20, 2025
Sudip Chakrabarty, Pappu Bishwas, Rajdeep Chatterjee ยท arXiv
Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this...
Wen Huang, Yuchen Mao, Yanmin Qian ยท arXiv
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplore...
Wen Huang, Yuchen Mao, Yanmin Qian ยท arXiv
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplore...
Stephen Ni-Hahn, Rico Zhu, Jerry Yin ... ยท arXiv
Hierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive...
Friday, December 19, 2025
Ioannis Stylianou, Achintya kr. Sarkar, Nauman Dawalatabad ... ยท arXiv
Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly availa...
Sujal Chondhekar, Vasanth Murukuri, Rushabh Vasani ... ยท arXiv
Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noi...
June Young Yi, Hyeongju Kim, Juheon Lee ยท arXiv
This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying...
Bowen Shi, Andros Tjandra, John Hoffman ... ยท arXiv
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or l...
Zhedong Zhang, Liang Li, Gaoxiang Cong ... ยท AAAI 2026
Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character's visual performance. However, existing alignment approaches based on visual features face two key lim...
Thursday, December 18, 2025
Jiajun Yuan, Xiaochen Wang, Yuhang Xiao ... ยท arXiv
Applying speech super-resolution (SR) to recordings with severely low sampling rates is a critical challenge in digital archiving and investigative audio recovery. In these scenarios, the input lacks essential acoustic cues. Consequently, existing generative models often fail; wi...
Daniel Rika, Nino Sapir, Ido Gus ยท arXiv
We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demon...
Tuesday, December 16, 2025
Jiayan Cui, Zhihan Yang, Naihan Li ... ยท arXiv
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only...
Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K ยท arXiv
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-te...
Qilin Li, C. L. Philip Chen, TongZhang ยท arXiv
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 ins...
Qilin Li, C. L. Philip Chen, Tong Zhang ยท IEEE Transactions on Affective Computing
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 ins...
Advait Gosai, Tyler Vuong, Utkarsh Tyagi ... ยท arXiv
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn t...
Monday, December 15, 2025
Menglu Li, Majd Alber, Ramtin Asgarianamiri ... ยท arXiv
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated syn...
Tao Li, Wengshuo Ge, Zhichao Wang ... ยท arXiv
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this ...
Tao Li, Wenshuo Ge, Zhichao Wang ... ยท arXiv
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this ...