Audio ML Papers

Week of November 23 - November 30, 2025

Subcategories: All (28) | Speech Synthesis (6) | Music Synthesis (5) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (1) | Asr (1) | Other (13)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 90)
Arnesh Batra, Dev Sharma, Krish Thukral ... · Transactions on Machine Learning Research
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing eco...
#2 TOP PAPER (Score: 85)
Plein Versace · arXiv
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks -- including MLPs with Fourier features, SIREN, and multiresolution hash grids -- implicitly assume a \textit...
#3 TOP PAPER (Score: 83)
Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin ... · Journal of Supercomputing
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localiz...
Saturday, November 29, 2025
Arnesh Batra, Dev Sharma, Krish Thukral ... · Transactions on Machine Learning Research
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing eco...
Siyu Wang, Haitao Li · arXiv
Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS,...
S M Asiful Islam Saky, Md Rashidul Islam, Md Saiful Arefin ... · arXiv
Respiratory diseases remain major global health challenges, and traditional auscultation is often limited by subjectivity, environmental noise, and inter-clinician variability. This study presents an explainable multimodal deep learning framework for automatic lung-disease detect...
Friday, November 28, 2025
Chen Li, Peiji Yang, Yicheng Zhong ... · AAAI 2026
Recent advances in Speech Large Language Models (Speech LLMs) have led to great progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, parti...
Thursday, November 27, 2025
Juan Ignacio Alvarez-Trejos, Sergio A. Balanya, Daniel Ramos ... · arXiv
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing m...
Jiatong Shi, Haoran Wang, William Chen ... · ASRU2025
Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfoldin...
Juan Ignacio Alvarez-Trejos, Sergio A. Balanya, Daniel Ramos ... · arXiv
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing m...
Wednesday, November 26, 2025
Ivan Kalthoff, Marcel Rey, Raphael Wittkowski · arXiv
Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronic...
Bruno Padovese, Fabio Frazao, Michael Dowd ... · arXiv
Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective ...
Jionghao Han, Jiatong Shi, Zhuoyan Tao ... · arXiv
Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which ...
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq ... · arXiv
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent adva...
Yicheng Zhong, Peiji Yang, Zhisheng Wang · arXiv
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectu...
Zhisheng Zheng, Xiaohang Sun, Tuan Dinh ... · arXiv
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech...
Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen ... · arXiv
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts ...
Kexin Li, Xiao Hu, Ilya Grishchenko ... · arXiv
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As th...
Tuesday, November 25, 2025
Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra ... · 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 2025, pp. 1-8 · 2025 International Joint Conference on Neural Networks (IJCNN)
Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in m...
Sungjae Kim, Kihyun Na, Jinyoung Choi ... · arXiv
Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that o...
Rui Lin, Zhiyue Wu, Jiahe Le ... · arXiv
Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction wi...
Ilias Ibnyahya, Joshua D. Reiss · arXiv
We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling...
Monday, November 24, 2025
Ellie L. Zhang, Duoduo Liao, Callie C. Liao · IEEE Big Data 2025
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dy...
Aman Verma, Keshav Samdani, Mohd. Samiuddin Shafi · arXiv
This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initia...
Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin ... · Journal of Supercomputing
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localiz...
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single...
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single...
Congren Dai, Yue Yang, Krinos Li ... · arXiv
Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comp...
Sunday, November 23, 2025
Plein Versace · arXiv
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks -- including MLPs with Fourier features, SIREN, and multiresolution hash grids -- implicitly assume a \textit...
Chunyu Qiang, Kang Yin, Xiaopeng Wang ... · arXiv
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constr...
Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul ... · arXiv
Audio classifiers frequently face domain shift, when models trained on one dataset lose accuracy on data recorded in acoustically different conditions. Previous Test-Time Adaptation (TTA) research in speech and sound analysis often evaluates models under fixed or mismatched noise...