Audio ML Papers

Last 7 Days (November 26 - December 03, 2025)

Subcategories: All (17) | Speech Synthesis (5) | Music Synthesis (1) | Ambient Synthesis (1) | Quality Assessment (0) | Enhancement (1) | Asr (1) | Other (8)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 90)
Arnesh Batra, Dev Sharma, Krish Thukral ... · Transactions on Machine Learning Research
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing eco...
#2 TOP PAPER (Score: 83)
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq ... · arXiv
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent adva...
#3 TOP PAPER (Score: 83)
Yicheng Zhong, Peiji Yang, Zhisheng Wang · arXiv
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectu...
Monday, December 01, 2025
Pengfei Sun, Wenyu Jiang, Paul Devos ... · IEEE Transactions on Audio, Speech and Language Processing, 2025 · IEEE Transactions on Audio, Speech and Language Processing
Advanced deep learning architectures, particularly recurrent neural networks (RNNs), have been widely applied in audio, bioacoustic, and biomedical signal analysis, especially in data-scarce environments. While gated RNNs remain effective, they can be relatively over-parameterise...
Tal Shuster, Eliya Nachmani · arXiv
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geomet...
Saturday, November 29, 2025
Arnesh Batra, Dev Sharma, Krish Thukral ... · Transactions on Machine Learning Research
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing eco...
Siyu Wang, Haitao Li · arXiv
Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS,...
S M Asiful Islam Saky, Md Rashidul Islam, Md Saiful Arefin ... · arXiv
Respiratory diseases remain major global health challenges, and traditional auscultation is often limited by subjectivity, environmental noise, and inter-clinician variability. This study presents an explainable multimodal deep learning framework for automatic lung-disease detect...
Friday, November 28, 2025
Chen Li, Peiji Yang, Yicheng Zhong ... · AAAI 2026
Recent advances in Speech Large Language Models (Speech LLMs) have led to great progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, parti...
Thursday, November 27, 2025
Juan Ignacio Alvarez-Trejos, Sergio A. Balanya, Daniel Ramos ... · arXiv
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing m...
Jiatong Shi, Haoran Wang, William Chen ... · ASRU2025
Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfoldin...
Juan Ignacio Alvarez-Trejos, Sergio A. Balanya, Daniel Ramos ... · arXiv
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing m...
Wednesday, November 26, 2025
Ivan Kalthoff, Marcel Rey, Raphael Wittkowski · arXiv
Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronic...
Bruno Padovese, Fabio Frazao, Michael Dowd ... · arXiv
Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective ...
Jionghao Han, Jiatong Shi, Zhuoyan Tao ... · arXiv
Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which ...
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq ... · arXiv
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent adva...
Yicheng Zhong, Peiji Yang, Zhisheng Wang · arXiv
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectu...
Zhisheng Zheng, Xiaohang Sun, Tuan Dinh ... · arXiv
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech...
Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen ... · arXiv
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts ...
Kexin Li, Xiao Hu, Ilya Grishchenko ... · arXiv
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As th...