Audio ML Papers

Last 7 Days (December 12 - December 19, 2025)

Subcategories: All (10) | Speech Synthesis (2) | Music Synthesis (2) | Ambient Synthesis (1) | Quality Assessment (0) | Enhancement (0) | Asr (1) | Other (4)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 83)
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka ... ยท ASRU 2025
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and ...
#2 TOP PAPER (Score: 83)
Menglu Li, Majd Alber, Ramtin Asgarianamiri ... ยท arXiv
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated syn...
#3 TOP PAPER (Score: 83)
Jiayan Cui, Zhihan Yang, Naihan Li ... ยท arXiv
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only...
Tuesday, December 16, 2025
Jiayan Cui, Zhihan Yang, Naihan Li ... ยท arXiv
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only...
Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K ยท arXiv
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-te...
Qilin Li, C. L. Philip Chen, TongZhang ยท arXiv
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 ins...
Qilin Li, C. L. Philip Chen, Tong Zhang ยท IEEE Transactions on Affective Computing
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 ins...
Advait Gosai, Tyler Vuong, Utkarsh Tyagi ... ยท arXiv
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn t...
Monday, December 15, 2025
Menglu Li, Majd Alber, Ramtin Asgarianamiri ... ยท arXiv
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated syn...
Tao Li, Wengshuo Ge, Zhichao Wang ... ยท arXiv
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this ...
Friday, December 12, 2025
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka ... ยท ASRU 2025
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and ...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...