Audio ML Papers

Week of November 30 - December 07, 2025

Subcategories: All (15) | Speech Synthesis (4) | Music Synthesis (1) | Ambient Synthesis (2) | Quality Assessment (1) | Enhancement (1) | Asr (0) | Other (6)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Pengfei Sun, Wenyu Jiang, Paul Devos ... · IEEE Transactions on Audio, Speech and Language Processing, 2025 · IEEE Transactions on Audio, Speech and Language Processing
Advanced deep learning architectures, particularly recurrent neural networks (RNNs), have been widely applied in audio, bioacoustic, and biomedical signal analysis, especially in data-scarce environments. While gated RNNs remain effective, they can be relatively over-parameterise...
#2 TOP PAPER (Score: 83)
Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour ... · arXiv
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alig...
#3 TOP PAPER (Score: 83)
Junjie Zheng, Chunbo Hao, Guobin Ma ... · arXiv
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we pr...
Saturday, December 06, 2025
Xiao Zhan, Guangzhi Sun, Jose Such ... · arXiv
Large audio language models (LALMs) are increasingly deployed in real-world settings where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences largely overlook. We introduce SH-Bench, the first benchmark de...
Yash Choudhary, Preeti Rao, Pushpak Bhattacharyya · arXiv
Predicting a song's commercial success prior to its release remains an open and critical research challenge for the music industry. Early prediction of music popularity informs strategic decisions, creative planning, and marketing. Existing methods suffer from four limitations:(i...
Thursday, December 04, 2025
Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghidour ... · arXiv
Pre-trained audio models excel at detecting acoustic patterns in auscultation sounds but often fail to grasp their clinical significance, limiting their use and performance in diagnostic tasks. To bridge this gap, we introduce AcuLa (Audio-Clinical Understanding via Language Alig...
Cong Wang, Changfeng Gao, Yang Xiang ... · arXiv
Differentiable reinforcement learning (RL) frameworks like DiffRO offer a powerful approach for controllable text-to-speech (TTS), but are vulnerable to reward hacking, particularly for nuanced tasks like emotion control. The policy model can exploit a vanilla Reward Model (RM) b...
Junjie Zheng, Chunbo Hao, Guobin Ma ... · arXiv
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment and manually annotated melody contours, requirements that are resource-intensive and hinder scalability. To overcome these limitations, we pr...
Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng ... · arXiv
Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingM...
Ziling Huang · arXiv
In our recent work, we proposed Lightweight Speech Enhancement Guided Target Speech Extraction (LGTSE) and demonstrated its effectiveness in multi-speaker-plus-noise scenarios. However, real-world applications often involve more diverse and complex conditions, such as one-speaker...
Xiaopeng Wang, Chunyu Qiang, Ruibo Fu ... · arXiv
Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness ...
Wenzhang Du · arXiv
Subjective mean opinion scores (MOS) remain the de-facto target for non-intrusive speech and singing quality assessment. However, MOS is a scalar that collapses heterogeneous user expectations, ignores service-level objectives, and is difficult to compare across deployment graphs...
Wednesday, December 03, 2025
Kohei Yamamoto, Kosuke Okusa · arXiv
Transformer-based audio SSL (self-supervised learning) models often treat spectrograms as images, applying convolutional patchification with heavy temporal downsampling. This lowers the effective Nyquist frequency and introduces aliasing, while naĂŻve low-pass filtering removes ta...
Tuesday, December 02, 2025
Mohan Shi, Natarajan Balaji Shankar, Kaiyuan Zhang ... · arXiv
Discrete speech tokens have gained attention for their storage efficiency and integration with Large Language Models (LLMs). They are commonly categorized into acoustic and semantic tokens, with the latter being more advantageous for Automatic Speech Recognition (ASR). Traditiona...
Lixing He, Yunqi Guo, Haozheng Hou ... · arXiv
Earables, such as True Wireless Stereo earphones and VR/AR headsets, are increasingly popular, yet their compact design poses challenges for robust voice-related applications like telecommunication and voice assistant interactions in noisy environments. Existing speech enhancemen...
Hong-Jie You, Jie-Jing Shao, Xiao-Wen Yang ... · arXiv
Existing methods for expressive music performance rendering rely on supervised learning over small labeled datasets, which limits scaling of both data volume and model size, despite the availability of vast unlabeled music, as in vision and language. To address this gap, we intro...
Monday, December 01, 2025
Pengfei Sun, Wenyu Jiang, Paul Devos ... · IEEE Transactions on Audio, Speech and Language Processing, 2025 · IEEE Transactions on Audio, Speech and Language Processing
Advanced deep learning architectures, particularly recurrent neural networks (RNNs), have been widely applied in audio, bioacoustic, and biomedical signal analysis, especially in data-scarce environments. While gated RNNs remain effective, they can be relatively over-parameterise...
Tal Shuster, Eliya Nachmani · arXiv
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geomet...