Audio ML Papers

Week of October 12 - October 19, 2025

Subcategories: All (34) | Speech Synthesis (7) | Music Synthesis (6) | Ambient Synthesis (0) | Quality Assessment (3) | Enhancement (1) | Asr (3) | Other (14)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee · arXiv
Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysi...
#2 TOP PAPER (Score: 83)
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee ... · arXiv
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that...
#3 TOP PAPER (Score: 83)
Xuyao Deng, Yanjie Sun, Yong Dou ... · arXiv
Scaling laws have profoundly shaped our understanding of model performance in computer vision and natural language processing, yet their application to general audio representation learning remains underexplored. A key challenge lies in the multifactorial nature of general audio ...
Saturday, October 18, 2025
Danielle Yaffe, Ferdinand Campe, Prachi Sharma ... · arXiv
Audio-visual speech enhancement (AVSE) has been found to be particularly useful at low signal-to-noise (SNR) ratios due to the immunity of the visual features to acoustic noise. However, a significant gap exists in AVSE methods tailored to enhance spatial audio under low-SNR cond...
Friday, October 17, 2025
Yueqian Lin, Zhengmian Hu, Jayakumar Subramanian ... · IEEE ASRU 2025 Demo Track
Effective human-AI collaboration on complex reasoning tasks requires that users understand and interact with the model's process, not just receive an output. However, the monolithic text from methods like Chain-of-Thought (CoT) prevents this, as current interfaces lack real-time ...
Chitralekha Gupta, Soundarya Ramesh, Praveen Sasikumar ... · Neurips (Datasets and Benchmarks Track) 2025
Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise...
Azalea Gui, Woosung Choi, Junghyun Koo ... · arXiv
The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typically un...
Thursday, October 16, 2025
Hui Wang, Jinghua Zhao, Cheng Liu ... · arXiv
Text-to-audio (TTA) is rapidly advancing, with broad potential in virtual reality, accessibility, and creative media. However, evaluating TTA quality remains difficult: human ratings are costly and limited, while existing objective metrics capture only partial aspects of perceptu...
Hui Wang, Jinghua Zhao, Yifan Yang ... · arXiv
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and langua...
Jiangyu Han, Ruoyu Wang, Yoshiki Masuyama ... · arXiv
Self-supervised models such as WavLM have demonstrated strong performance for neural speaker diarization. However, these models are typically pre-trained on single-channel recordings, limiting their effectiveness in multi-channel scenarios. Existing diarization systems built on t...
Wednesday, October 15, 2025
Xuanchen Wang, Heng Wang, Weidong Cai · arXiv
Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive move...
Ruitao Feng, Bixi Zhang, Sheng Liang ... · arXiv
Aligning pretrained audio encoders and Large Language Models (LLMs) offers a promising, parameter-efficient path to building powerful multimodal agents. However, existing methods often require costly full-model finetuning or rely on static adapters that may lack expressive power....
Zhenyu Liu, Yunxin Li, Xuanyu Zhang ... · arXiv
Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. Th...
Sungnyun Kim, Kangwook Jang, Sungwoo Cho ... · arXiv
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose in...
Tuesday, October 14, 2025
Junnuo Wang · Journal of Artificial Intelligence Research (JAIR)
Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the ...
Yakun Song, Xiaobin Zhuang, Jiawei Chen ... · arXiv
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot tex...
Yakun Song, Xiaobin Zhuang, Jiawei Chen ... · arXiv
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot tex...
Xinlu He, Swayambhu Nath Ray, Harish Mallidi ... · arXiv
Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuo...
Guanxin Jiang, Andreas Brendel, Pablo M. Delgado ... · arXiv
This paper presents the Deep learning-based Perceptual Audio Quality metric (DeePAQ) for evaluating general audio quality. Our approach leverages metric learning together with the music foundation model MERT, guided by surrogate labels, to construct an embedding space that captur...
Wanying Ge, Xin Wang, Junichi Yamagishi · arXiv
Deepfake speech attribution remains challenging for existing solutions. Classifier-based solutions often fail to generalize to domain-shifted samples, and watermarking-based solutions are easily compromised by distortions like codec compression or malicious removal attacks. To ad...
Wanying Ge, Xin Wang, Junichi Yamagishi · arXiv
Deepfake speech attribution remains challenging for existing solutions. Classifier-based solutions often fail to generalize to domain-shifted samples, and watermarking-based solutions are easily compromised by distortions like codec compression or malicious removal attacks. To ad...
Monday, October 13, 2025
Kuan-Yi Lee, Tsung-En Lin, Hung-Yi Lee · arXiv
Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysi...
KiHyun Nam, Jongmin Choi, Hyeongkeun Lee ... · arXiv
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that...
Téo Guichoux, Théodor Lemerle, Shivam Mehta ... · arXiv
Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesize...
Jinchuan Tian, Sang-gil Lee, Zhifeng Kong ... · arXiv
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Lan...
Xuyao Deng, Yanjie Sun, Yong Dou ... · arXiv
Scaling laws have profoundly shaped our understanding of model performance in computer vision and natural language processing, yet their application to general audio representation learning remains underexplored. A key challenge lies in the multifactorial nature of general audio ...
Yi Wang, Yinfeng Yu, Fuchun Sun ... · International Conference on Virtual Reality and Visualization 2025 (ICVRV 2025)
Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rat...
Jingyuan Xing, Mingru Yang, Zhipeng Li ... · arXiv
Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) ...
Cheng Gong, Chunyu Qiang, Tianrui Wang ... · arXiv
Cross-lingual emotional text-to-speech (TTS) aims to produce speech in one language that captures the emotion of a speaker from another language while maintaining the target voice's timbre. This process of cross-lingual emotional speech synthesis presents a complex challenge, nec...
Jiliang Hu, Wenfu Wang, Zuchao Li ... · arXiv
Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multi...
Sunday, October 12, 2025
Wenxiang Guo, Changhao Pan, Zhiyuan Zhu ... · arXiv
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide...
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery · arXiv
Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English c...
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery · arXiv
Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-spe...
Ling Sun, Charlotte Zhu, Shuju Shi · arXiv
General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens ...
Wenxiang Guo, Changhao Pan, Zhiyuan Zhu ... · arXiv
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide...
Zihan Zhang, Xize Cheng, Zhennan Jiang ... · arXiv
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MA...
Ummy Maria Muna, Md Mehedi Hasan Shawon, Md Jobayer ... · arXiv
The automated analysis of phonocardiograms is vital for the early diagnosis of cardiovascular disease, yet supervised deep learning is often constrained by the scarcity of expert-annotated data. In this paper, we propose the Self-Supervised Dual-Path Prototypical Network (SS-DPPN...