Audio ML Papers

Week of December 07 - December 14, 2025

Subcategories: All (17) | Speech Synthesis (1) | Music Synthesis (3) | Ambient Synthesis (2) | Quality Assessment (2) | Enhancement (2) | Asr (1) | Other (6)
← Previous Week | Current Week →

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 92)
Zihao Wang, Ruibin Yuan, Ziqi Geng ... ยท Proceedings of the 33rd ACM International Conference on Multimedia (ACMMM 2025), Pages 12227-12236 ยท Proceedings of the 33rd ACM International Conference on Multimedia (ACMMM 2025)
Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based so...
#2 TOP PAPER (Score: 83)
Shaoying Wang, Hansong Zhou, Yukun Yuan ... ยท arXiv
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attacke...
#3 TOP PAPER (Score: 83)
Junyi Peng, Lin Zhang, Jin Li ... ยท arXiv
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a r...
Friday, December 12, 2025
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka ... ยท ASRU 2025
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and ...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...
Thursday, December 11, 2025
Tianyu Guo, Hongyu Chen, Hao Liang ... ยท arXiv
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth ca...
Zitong Lan, Yiwei Tang, Yuhan Wang ... ยท arXiv
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobil...
Alon Ziv, Sanyuan Chen, Andros Tjandra ... ยท arXiv
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation mod...
Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli ... ยท arXiv
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, ...
Wednesday, December 10, 2025
Maris Basha, Anja Zai, Sabine Stoll ... ยท arXiv
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim...
Shaoying Wang, Hansong Zhou, Yukun Yuan ... ยท arXiv
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attacke...
Kang Yin, Chunyu Qiang, Sirui Zhao ... ยท arXiv
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with ex...
Tuesday, December 09, 2025
Junyi Peng, Lin Zhang, Jin Li ... ยท arXiv
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a r...
Mahathir Monjur, Shahriar Nirjon ยท arXiv
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully con...
Zhuohang Han, Jincheng Dai, Shengshi Yao ... ยท arXiv
Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhanc...
Monday, December 08, 2025
Georgios Ioannides, Christos Constantinou, Aman Chadha ... ยท UniReps: Unifying Representations in Neural Models (NeurIPS 2025 Workshop)
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via mask...
Xueping Zhang, Zhenshan Zhang, Yechen Wang ... ยท arXiv
Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoo...
Runwu Shi, Chang Li, Jiang Wang ... ยท The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. T...
Sunday, December 07, 2025
Zihao Wang, Ruibin Yuan, Ziqi Geng ... ยท Proceedings of the 33rd ACM International Conference on Multimedia (ACMMM 2025), Pages 12227-12236 ยท Proceedings of the 33rd ACM International Conference on Multimedia (ACMMM 2025)
Automated singing assessment is crucial for education and entertainment. However, existing systems face two fundamental limitations: reliance on reference tracks, which stifles creative expression, and the simplification of complex performances into non-diagnostic scores based so...