Audio ML Papers

Week of October 26 - November 02, 2025

Subcategories: All (30) | Speech Synthesis (6) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Assessment (1) | Enhancement (3) | Asr (1) | Other (15)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Kang Zhang, Trung X. Pham, Suyeon Lee ... · NeurIPS 2025
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the gene...
#2 TOP PAPER (Score: 92)
Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen ... · arXiv
Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the r...
#3 TOP PAPER (Score: 84)
Amine Razig, Youssef Soulaymani, Loubna Benabbou ... · arXiv
Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framewo...
Saturday, November 01, 2025
Kazuya Yokota, Ryosuke Harakawa, Masaaki Baba ... · arXiv
The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The netwo...
Swapnil Bhosale, Cosmin Frateanu, Camilla Clark ... · arXiv
Deploying accurate event detection on resource-constrained devices is challenged by the trade-off between performance and computational cost. While Early-Exit (EE) networks offer a solution through adaptive computation, they often fail to enforce a coherent hierarchical structure...
Friday, October 31, 2025
Zongyang Du, Shreeram Suresh Chandra, Ismail Rasim Ulgen ... · arXiv
Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the r...
Thursday, October 30, 2025
Weifei Jin, Yuxin Cao, Junjie Su ... · NeurIPS 2025
Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically ...
Hitomi Jin Ling Tee, Chaoren Wang, Zijie Zhang ... · arXiv
The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoke...
Wednesday, October 29, 2025
Amine Razig, Youssef Soulaymani, Loubna Benabbou ... · arXiv
Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framewo...
Jiarong Du, Zhan Jin, Peijun Yang ... · arXiv
Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Mo...
Diego Torres, Axel Roebel, Nicolas Obin · arXiv
We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vect...
Xiaoyu Yang, Yifan Yang, Zengrui Jin ... · arXiv
Self-Supervised Learning (SSL) excels at learning generic representations of acoustic signals, yet prevailing methods remain domain-specific, tailored to either speech or general audio, hindering the development of a unified representation model with a comprehensive capability ov...
Christodoulos Benetatos, Yongyi Zang, Randal Leistikow · arXiv
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunki...
Tuesday, October 28, 2025
Kang Zhang, Trung X. Pham, Suyeon Lee ... · NeurIPS 2025
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the gene...
Yassine El Kheir, Fabian Ritter-Guttierez, Arnab Das ... · arXiv
Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the mu...
Ziyang Zhang, Yifan Gao, Xuenan Xu ... · arXiv
Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quant...
Harshavardhana T. Gowda, Lee M. Miller · arXiv
We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the elec...
Matteo Calafà, Yuanxin Xia, Cheol-Ho Jeong · arXiv
We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively ...
Baizhou Lin, Yuetong Fang, Renjing Xu ... · arXiv
The Tsetlin Machine (TM) has recently attracted attention as a low-power alternative to neural networks due to its simple and interpretable inference mechanisms. However, its performance on speech-related tasks remains limited. This paper proposes TsetlinKWS, the first algorithm-...
Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof ... · arXiv
Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene represen...
Zihan Liu, Zhikang Niu, Qiuyang Xiao ... · arXiv
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that i...
Monday, October 27, 2025
Jing-Xuan Zhang, Genshun Wan, Jin Li ... · arXiv
Unified speech recognition aims to perform auditory, visual, and audiovisual speech recognition within a single model framework. While speech foundation models (SFMs) have demonstrated remarkable performance in auditory tasks, their adaptation to multimodal scenarios remains unde...
Bohan Li, Wenbin Huang, Yuhang Qiu ... · arXiv
Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sens...
Yuepeng Jiang, Huakang Chen, Ziqian Ning ... · arXiv
Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often...
Yuepeng Jiang, Huakang Chen, Ziqian Ning ... · arXiv
Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often...
Hanke Xie, Haopeng Lin, Wenxiao Cao ... · arXiv
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical ...
Hanke Xie, Haopeng Lin, Wenxiao Cao ... · arXiv
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical ...
Sarabeth S. Mullins, Georg Götz, Eric Bezzam ... · arXiv
Accurate far-field speech datasets are critical for tasks such as automatic speech recognition (ASR), dereverberation, speech enhancement, and source separation. However, current datasets are limited by the trade-off between acoustic realism and scalability. Measured corpora prov...
Máté Gedeon, Péter Mihajlik · arXiv
We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly r...
Jiyoung Hong, Yoonseo Chung, Seungyeon Oh ... · arXiv
Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current system...
Sunday, October 26, 2025
Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer ... · NeurIPS 2025
Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edite...
Yuval Bar Ilan, Boaz Rafaely, Vladimir Tourbabin · arXiv
Speech enhancement is a fundamental challenge in signal processing, particularly when robustness is required across diverse acoustic conditions and microphone setups. Deep learning methods have been successful for speech enhancement, but often assume fixed array geometries, limit...
Wenming Tu, Guanrou Yang, Ruiqi Yan ... · arXiv
Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we intr...