Audio ML Papers

Last 7 Days (December 09 - December 16, 2025)

Subcategories: All (12) | Speech Synthesis (1) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (2) | Enhancement (1) | Asr (1) | Other (3)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 83)
Shaoying Wang, Hansong Zhou, Yukun Yuan ... ยท arXiv
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attacke...
#2 TOP PAPER (Score: 83)
Junyi Peng, Lin Zhang, Jin Li ... ยท arXiv
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a r...
#3 TOP PAPER (Score: 83)
Mahathir Monjur, Shahriar Nirjon ยท arXiv
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully con...
Friday, December 12, 2025
Takafumi Moriya, Masato Mimura, Tomohiro Tanaka ... ยท ASRU 2025
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and ...
Longshen Ou, Ye Wang ยท arXiv
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by...
Thursday, December 11, 2025
Tianyu Guo, Hongyu Chen, Hao Liang ... ยท arXiv
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth ca...
Zitong Lan, Yiwei Tang, Yuhan Wang ... ยท arXiv
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobil...
Alon Ziv, Sanyuan Chen, Andros Tjandra ... ยท arXiv
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation mod...
Lucas Dunker, Sai Akshay Menta, Snigdha Mohana Addepalli ... ยท arXiv
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, ...
Wednesday, December 10, 2025
Maris Basha, Anja Zai, Sabine Stoll ... ยท arXiv
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim...
Shaoying Wang, Hansong Zhou, Yukun Yuan ... ยท arXiv
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attacke...
Kang Yin, Chunyu Qiang, Sirui Zhao ... ยท arXiv
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with ex...
Tuesday, December 09, 2025
Junyi Peng, Lin Zhang, Jin Li ... ยท arXiv
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a r...
Mahathir Monjur, Shahriar Nirjon ยท arXiv
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully con...
Zhuohang Han, Jincheng Dai, Shengshi Yao ... ยท arXiv
Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhanc...