Audio ML Papers

Last 7 Days (December 11 - December 18, 2025)

Subcategories: All (12) | Speech Synthesis (2) | Music Synthesis (3) | Ambient Synthesis (0) | Quality Assessment (1) | Enhancement (0) | Asr (1) | Other (5)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 83)
Zitong Lan, Yiwei Tang, Yuhan Wang ... ยท arXiv
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobil...
#2 TOP PAPER (Score: 83)
Tianyu Guo, Hongyu Chen, Hao Liang ... ยท arXiv
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth ca...
#3 TOP PAPER (Score: 83)
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka ... ยท ASRU 2025
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and ...
Tuesday, December 16, 2025
Jiayan Cui, Zhihan Yang, Naihan Li ... ยท arXiv
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only...
Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K ยท arXiv
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-te...
Qilin Li, C. L. Philip Chen, TongZhang ยท arXiv
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 ins...
Monday, December 15, 2025
Menglu Li, Majd Alber, Ramtin Asgarianamiri ... ยท arXiv
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated syn...
Tao Li, Wengshuo Ge, Zhichao Wang ... ยท arXiv
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this ...
Friday, December 12, 2025
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka ... ยท ASRU 2025
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and ...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...
Thursday, December 11, 2025
Tianyu Guo, Hongyu Chen, Hao Liang ... ยท arXiv
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth ca...
Zitong Lan, Yiwei Tang, Yuhan Wang ... ยท arXiv
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobil...
Alon Ziv, Sanyuan Chen, Andros Tjandra ... ยท arXiv
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation mod...
Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli ... ยท arXiv
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, ...