Audio ML Papers

Week of November 23 - November 30, 2025

Subcategories: All (20) | Speech Synthesis (4) | Music Synthesis (4) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (0) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (11)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 85)
Plein Versace · arXiv
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks -- including MLPs with Fourier features, SIREN, and multiresolution hash grids -- implicitly assume a \textit...
#2 TOP PAPER (Score: 83)
Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin ... · Journal of Supercomputing
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localiz...
#3 TOP PAPER (Score: 83)
Ellie L. Zhang, Duoduo Liao, Callie C. Liao · IEEE Big Data 2025
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dy...
Wednesday, November 26, 2025
Ivan Kalthoff, Marcel Rey, Raphael Wittkowski · arXiv
Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronic...
Jionghao Han, Jiatong Shi, Zhuoyan Tao ... · arXiv
Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which ...
Benoît Giniès, Xiaoyu Bie, Olivier Fercoq ... · arXiv
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent adva...
Yicheng Zhong, Peiji Yang, Zhisheng Wang · arXiv
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectu...
Zhisheng Zheng, Xiaohang Sun, Tuan Dinh ... · arXiv
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech...
Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen ... · arXiv
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts ...
Kexin Li, Xiao Hu, Ilya Grishchenko ... · arXiv
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As th...
Tuesday, November 25, 2025
Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra ... · 2025 International Joint Conference on Neural Networks (IJCNN), Rome, Italy, 2025, pp. 1-8 · 2025 International Joint Conference on Neural Networks (IJCNN)
Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in m...
Sungjae Kim, Kihyun Na, Jinyoung Choi ... · arXiv
Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that o...
Rui Lin, Zhiyue Wu, Jiahe Le ... · arXiv
Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction wi...
Ilias Ibnyahya, Joshua D. Reiss · arXiv
We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling...
Monday, November 24, 2025
Ellie L. Zhang, Duoduo Liao, Callie C. Liao · IEEE Big Data 2025
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dy...
Aman Verma, Keshav Samdani, Mohd. Samiuddin Shafi · arXiv
This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initia...
Jorge Ortigoso-Narro, Jose A. Belloch, Adrian Amor-Martin ... · Journal of Supercomputing
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localiz...
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single...
Huadai Liu, Kaicheng Luo, Wen Wang ... · arXiv
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single...
Congren Dai, Yue Yang, Krinos Li ... · arXiv
Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comp...
Sunday, November 23, 2025
Plein Versace · arXiv
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks -- including MLPs with Fourier features, SIREN, and multiresolution hash grids -- implicitly assume a \textit...
Chunyu Qiang, Kang Yin, Xiaopeng Wang ... · arXiv
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constr...
Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul ... · arXiv
Audio classifiers frequently face domain shift, when models trained on one dataset lose accuracy on data recorded in acoustically different conditions. Previous Test-Time Adaptation (TTA) research in speech and sound analysis often evaluates models under fixed or mismatched noise...