Audio ML Papers

Week of October 19 - October 26, 2025

Subcategories: All (40) | Speech Synthesis (6) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (4) | Enhancement (3) | Asr (3) | Other (20)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 87)
Xusheng Yang, Long Zhou, Wenfu Wang ... · arXiv
We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intel...
#2 TOP PAPER (Score: 84)
Massa Baali, Rita Singh, Bhiksha Raj · arXiv
Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised fo...
#3 TOP PAPER (Score: 84)
Yu-Wen Chen, William Ho, Sasha M. Vergez ... · The Second Workshop on GenAI for Health at NeurIPS 2025
The growing demand for home healthcare calls for tools that can support care delivery. In this study, we explore automatic health assessment from voice using real-world home care visit data, leveraging the diverse patient information it contains. First, we utilize Large Language ...
Saturday, October 25, 2025
Heejoon Koo, Miika Toikkanen, Yoon Tae Kim ... · arXiv
Multimodal respiratory sound classification offers promise for early pulmonary disease detection by integrating bioacoustic signals with patient metadata. Nevertheless, current approaches remain vulnerable to spurious correlations from attributes such as age, sex, or acquisition ...
Sapir Goldring, Zamir Ben Hur, David Lou Alon ... · arXiv
This paper investigates the performance of Binaural Signal Matching (BSM) methods for near-field sound reproduction using a wearable glasses-mounted microphone array. BSM is a flexible, signal-independent approach for binaural rendering with arbitrary arrays, but its conventional...
Ali Vosoughi, Yongyi Zang, Qihui Yang ... · arXiv
Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically ...
Ali Vosoughi, Yongyi Zang, Qihui Yang ... · arXiv
Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically ...
Krishna Gurugubelli · arXiv
Automated dysarthria detection and severity assessment from speech have attracted significant research attention due to their potential clinical impact. Despite rapid progress in acoustic modeling and deep learning, models still fall short of human expert performance. This manusc...
Friday, October 24, 2025
Arshdeep Singh, Vinayak Abrol, Mark D. Plumbley · arXiv
Conventional Convolutional Neural Networks (CNNs) in the real domain have been widely used for audio classification. However, their convolution operations process multi-channel inputs independently, limiting the ability to capture correlations among channels. This can lead to sub...
Shivam Saini, Jürgen Peissig · arXiv
We introduce HiFi-HARP, a large-scale dataset of 7th-order Higher-Order Ambisonic Room Impulse Responses (HOA-RIRs) consisting of more than 100,000 RIRs generated via a hybrid acoustic simulation in realistic indoor scenes. HiFi-HARP combines geometrically complex, furnished room...
Yongyi Zang, Chris Manchester, David Young ... · arXiv
Vocal recordings on consumer devices commonly suffer from multiple concurrent degradations: noise, reverberation, band-limiting, and clipping. We present Smule Renaissance Small (SRS), a compact single-stage model that performs end-to-end vocal restoration directly in the complex...
Qihui Yang, Randal Leistikow, Yongyi Zang · arXiv
Virtual instrument generation requires maintaining consistent timbre across different pitches and velocities, a challenge that existing note-level models struggle to address. We present FlowSynth, which combines distributional flow matching (DFM) with test-time optimization for h...
Zixiang Wan, Haoran Zhao, Guochang Zhang ... · arXiv
This paper presents PhoenixCodec, a comprehensive neural speech coding and decoding framework designed for extremely low-resource conditions. The proposed system integrates an optimized asymmetric frequency-time architecture, a Cyclical Calibration and Refinement (CCR) training s...
Jingyue Huang, Qihui Yang, Fei Yueh Chen ... · arXiv
Existing pitch curve generators face two main challenges: they often neglect singer-specific expressiveness, reducing their ability to capture individual singing styles. And they are typically developed as auxiliary modules for specific tasks such as pitch correction, singing voi...
Zixiang Wan, Guochang Zhang, Yifeng He ... · Interspeech 2025
Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight an...
Yihan Wu, Georgios Milis, Ruibo Chen ... · arXiv
The rapid advancement of next-token-prediction models has led to widespread adoption across modalities, enabling the creation of realistic synthetic media. In the audio domain, while autoregressive speech models have propelled conversational interactions forward, the potential fo...
Thursday, October 23, 2025
Zhiyu Lin, Jingwen Yang, Jiale Zhao ... · arXiv
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costl...
Ari Frummer, Helin Wang, Tianyu Cao ... · arXiv
Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio qu...
Zitong Lan, Yiduo Hao, Mingmin Zhao · arXiv
Achieving immersive auditory experiences in virtual environments requires flexible sound modeling that supports dynamic source positions. In this paper, we introduce a task called resounding, which aims to estimate room impulse responses at arbitrary emitter location from a spars...
Junjie Zheng, Gongyu Chen, Chaofan Ding ... · arXiv
In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference ...
Xin Zhang, Lin Li, Xiangni Lu ... · arXiv
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. ...
Wednesday, October 22, 2025
Hyungjun Yoon, Seungjoo Lee, Yu Yvonne Wu ... · arXiv
Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i) insufficient data diversity, as most ExG recordings are collected in controll...
Tong Zhang, Yihuan Huang, Yanzhen Ren · arXiv
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail whe...
Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul ... · arXiv
Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio...
Vishaal Udandarao, Zhiyun Lu, Xuankai Chang ... · arXiv
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablatio...
Tuesday, October 21, 2025
Bunlong Lay, Rostislav Makarov, Simon Welker ... · arXiv
Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple...
Zhanhong He, Hanyu Meng, David Huang ... · arXiv
Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. Thes...
Haowei Lou, Hye-Young Paik, Wen Hu ... · arXiv
Controlling speaking style in text-to-speech (TTS) systems has become a growing focus in both academia and industry. While many existing approaches rely on reference audio to guide style generation, such methods are often impractical due to privacy concerns and limited accessibil...
Qianheng Xu · arXiv
Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recogn...
Hanyu Meng, Vidhyasaharan Sethu, Eliathamby Ambikairajah ... · arXiv
In audio signal processing, learnable front-ends have shown strong performance across diverse tasks by optimizing task-specific representation. However, their parameters remain fixed once trained, lacking flexibility during inference and limiting robustness under dynamic complex ...
Bin Gu, Lipeng Dai, Huipeng Du ... · arXiv
Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-cond...
Monday, October 20, 2025
Massa Baali, Rita Singh, Bhiksha Raj · arXiv
Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised fo...
Yu-Wen Chen, William Ho, Sasha M. Vergez ... · The Second Workshop on GenAI for Health at NeurIPS 2025
The growing demand for home healthcare calls for tools that can support care delivery. In this study, we explore automatic health assessment from voice using real-world home care visit data, leveraging the diverse patient information it contains. First, we utilize Large Language ...
Kyung Yun Lee, Nils Meyer-Kahlen, Karolina Prawda ... · arXiv
We address the problem of estimating room impulse responses (RIRs) in noisy, uncontrolled environments where non-stationary sounds such as speech or footsteps corrupt conventional deconvolution. We propose AnyRIR, a non-intrusive method that uses music as the excitation signal in...
Peihong Zhang, Yuxuan Liu, Rui Sang ... · arXiv
Acoustic scene classification (ASC) suffers from device-induced domain shift, especially when labels are limited. Prior work focuses on curriculum-based training schedules that structure data presentation by ordering or reweighting training examples from easy-to-hard to facilitat...
Stavros Mitsis, Ermos Hadjikyriakos, Humaid Ibrahim ... · arXiv
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where c...
Peihong Zhang, Zhixin Li, Yuxuan Liu ... · arXiv
Deep learning approaches for heart-sound (PCG) segmentation built on time--frequency features can be accurate but often rely on large expert-labeled datasets, limiting robustness and deployment. We present TopSeg, a topological representation-centric framework that encodes PCG dy...
Kosta Pavlović, Lazar Stanarević, Petar Nedić ... · arXiv
Prevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance ...
Weilin Lin, Jianze Li, Hui Xiong ... · arXiv
Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignmen...
Sunday, October 19, 2025
Xusheng Yang, Long Zhou, Wenfu Wang ... · arXiv
We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intel...
Bo-Han Feng, Chien-Feng Liu, Yu-Hsuan Li Liang ... · arXiv
Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation ...
Wenxi Chen, Xinsheng Wang, Ruiqi Yan ... · arXiv
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both...
Tsun-An Hsieh, Sebastian Braun · arXiv
Generative models have shown robust performance on speech enhancement and restoration tasks, but most prior approaches operate offline with high latency, making them unsuitable for streaming applications. In this work, we investigate the feasibility of a low-latency, real-time ge...