Audio ML Papers

Last 7 Days (December 08 - December 15, 2025)

Subcategories: All (13) | Speech Synthesis (1) | Music Synthesis (1) | Ambient Synthesis (2) | Quality Assessment (2) | Enhancement (2) | Asr (0) | Other (5)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 83)
Shaoying Wang, Hansong Zhou, Yukun Yuan ... ยท arXiv
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attacke...
#2 TOP PAPER (Score: 83)
Junyi Peng, Lin Zhang, Jin Li ... ยท arXiv
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a r...
#3 TOP PAPER (Score: 83)
Mahathir Monjur, Shahriar Nirjon ยท arXiv
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully con...
Thursday, December 11, 2025
Tianyu Guo, Hongyu Chen, Hao Liang ... ยท arXiv
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth ca...
Zitong Lan, Yiwei Tang, Yuhan Wang ... ยท arXiv
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobil...
Alon Ziv, Sanyuan Chen, Andros Tjandra ... ยท arXiv
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation mod...
Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli ... ยท arXiv
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, ...
Wednesday, December 10, 2025
Maris Basha, Anja Zai, Sabine Stoll ... ยท arXiv
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim...
Shaoying Wang, Hansong Zhou, Yukun Yuan ... ยท arXiv
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attacke...
Kang Yin, Chunyu Qiang, Sirui Zhao ... ยท arXiv
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with ex...
Tuesday, December 09, 2025
Junyi Peng, Lin Zhang, Jin Li ... ยท arXiv
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a r...
Mahathir Monjur, Shahriar Nirjon ยท arXiv
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully con...
Zhuohang Han, Jincheng Dai, Shengshi Yao ... ยท arXiv
Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhanc...
Monday, December 08, 2025
Georgios Ioannides, Christos Constantinou, Aman Chadha ... ยท UniReps: Unifying Representations in Neural Models (NeurIPS 2025 Workshop)
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via mask...
Xueping Zhang, Zhenshan Zhang, Yechen Wang ... ยท arXiv
Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoo...
Runwu Shi, Chang Li, Jiang Wang ... ยท The 40th Annual AAAI Conference on Artificial Intelligence (AAAI 2026)
Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. T...