Audio ML Papers

Week of September 14 - September 21, 2025

Subcategories: All (56) | Speech Synthesis (10) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (3) | Enhancement (6) | Asr (6) | Other (27)
Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 84)
Dhruuv Agarwal, Harry Zhang, Yang Yu ... · arXiv
Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization v...
#2 TOP PAPER (Score: 83)
Vishnu Raja, Adithya V Ganesan, Anand Syamkumar ... · Will appear in EMNLP 2025 Main Proceedings
State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling str...
#3 TOP PAPER (Score: 83)
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze · arXiv
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does ...
Saturday, September 20, 2025
Vishnu Raja, Adithya V Ganesan, Anand Syamkumar ... · Will appear in EMNLP 2025 Main Proceedings
State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling str...
Maurício do V. M. da Costa, Eloi Moliner · accepted at IEEE International Symposium on the Internet of Sounds
This paper introduces MR-CQTdiff, a novel neural-network architecture for diffusion-based audio generation that leverages a multi-resolution Constant-$Q$ Transform (C$Q$T). The proposed architecture employs an efficient, invertible CQT framework that adjusts the time-frequency re...
Tse-Yang Chen, Yuh-Jzer Joung · arXiv
Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware me...
Friday, September 19, 2025
Dhruuv Agarwal, Harry Zhang, Yang Yu ... · arXiv
Personalizing Automatic Speech Recognition (ASR) for dysarthric speech is crucial but challenging due to training and storing of individual user adapters. We propose a hybrid meta-training method for a single model, excelling in zero-shot and few-shot on-the-fly personalization v...
Kaspar Müller, Markus Buck, Simon Doclo ... · Accepted for publication in IEEE Transactions on Audio, Speech and Language Processing
The steered response power (SRP) method is one of the most popular approaches for acoustic source localization with microphone arrays. It is often based on simplifying acoustic assumptions, such as an omnidirectional sound source in the far field of the microphone array(s), free ...
Ziqi Dai, Yiting Chen, Jiacheng Xu ... · arXiv
The pipeline for multi-participant audiobook production primarily consists of three stages: script analysis, character voice timbre selection, and speech synthesis. Among these, script analysis can be automated with high accuracy using NLP models, whereas character voice timbre s...
Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi ... · arXiv
Deep learning-based Sound Event Localization and Detection (SELD) systems degrade significantly on real-world, long-tailed datasets. Standard regression losses bias learning toward frequent classes, causing rare events to be systematically under-recognized. To address this challe...
Qiaolin Wang, Xilin Jiang, Linyang He ... · arXiv
While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale ch...
Yongsheng Feng, Yuetonghui Xu, Jiehui Luo ... · arXiv
Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and ...
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze · arXiv
We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does ...
Dohwan Kim, Jung-Woo Choi · arXiv
In speech enhancement, knowledge distillation (KD) compresses models by transferring a high-capacity teacher's knowledge to a compact student. However, conventional KD methods train the student to mimic the teacher's output entirely, which forces the student to imitate the region...
Qi Wang, Shituo Ma, Guoxin Yu ... · arXiv
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high c...
Luca Della Libera, Cem Subakan, Mirco Ravanelli · arXiv
Neural audio codecs are a fundamental component of modern generative audio pipelines. Although recent codecs achieve strong low-bitrate reconstruction and provide powerful representations for downstream tasks, most are non-streamable, limiting their use in real-time applications....
Younghoo Kwon, Jung-Woo Choi · arXiv
The spatial semantic segmentation task focuses on separating and classifying sound objects from multichannel signals. To achieve two different goals, conventional methods fine-tune a large classification model cascaded with the separation model and inject classified labels as sep...
Pengcheng Li, Botao Zhao, Zuheng Kang ... · Accepted by the Findings of 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP Findings 2025)
Although Large Audio-Language Models (LALMs) have exhibited outstanding performance in auditory understanding, their performance in affective computing scenarios, particularly in emotion recognition, reasoning, and subtle sentiment differentiation, remains suboptimal. Recent adva...
Yiru Zhang, Hang Su, Lichun Fan ... · arXiv
Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. How...
Mohd Mujtaba Akhtar, Girish, Orchid Chetia Phukan ... · Accepted to APSIPA-ASC 2025
In this work, we investigate multimodal foundation models (MFMs) for EmoFake detection (EFD) and hypothesize that they will outperform audio foundation models (AFMs). MFMs due to their cross-modal pre-training, learns emotional patterns from multiple modalities, while AFMs rely o...
Gang Yang, Yue Lei, Wenxin Tai ... · arXiv
Diffusion and flow matching (FM) models have achieved remarkable progress in speech enhancement (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors. Recent advances in one-step generative modeling, particularly ...
Thursday, September 18, 2025
Michael Tatarjitzky, Boaz Rafaely · arXiv
Multichannel speech enhancement leverages spatial cues to improve intelligibility and quality, but most learning-based methods rely on specific microphone array geometry, unable to account for geometry changes. To mitigate this limitation, current array-agnostic approaches employ...
Kangdi Wang, Zhiyue Wu, Dinghao Zhou ... · arXiv
Variational Autoencoders (VAEs) are essential for large-scale audio tasks like diffusion-based generation. However, existing open-source models often neglect auditory perceptual aspects during training, leading to weaknesses in phase accuracy and stereophonic spatial representati...
Ye-Xin Lu, Yu Gu, Kun Wei ... · arXiv
This paper presents DAIEN-TTS, a zero-shot text-to-speech (TTS) framework that enables ENvironment-aware synthesis through Disentangled Audio Infilling. By leveraging separate speaker and environment prompts, DAIEN-TTS allows independent control over the timbre and the background...
Kentaro Seki, Yuki Okamoto, Kouei Yamaoka ... · arXiv
Contrastive language--audio pretraining (CLAP) has achieved remarkable success as an audio--text embedding framework, but existing approaches are limited to monaural or single-source conditions and cannot fully capture spatial information. The central challenge in modeling spatia...
Ryan Collette, Ross Greenwood, Serena Nicoll · arXiv
While existing speech audio codecs designed for compression exploit limited forms of temporal redundancy and allow for multi-scale representations, they tend to represent all features of audio in the same way. In contrast, generative voice models designed for text-to-speech and v...
Daniyal Kabir Dar, Qiben Yan, Li Xiao ... · arXiv
Adversarial perturbations in speech pose a serious threat to automatic speech recognition (ASR) and speaker verification by introducing subtle waveform modifications that remain imperceptible to humans but can significantly alter system outputs. While targeted attacks on end-to-e...
Duojia Li, Shenghui Lu, Hongchen Pan ... · arXiv
Multistep inference is a bottleneck for real-time generative speech enhancement because flow- and diffusion-based systems learn an instantaneous velocity field and therefore rely on iterative ordinary differential equation (ODE) solvers. We introduce MeanFlowSE, a conditional gen...
Keyu An, Zhiyu Zhang, Changfeng Gao ... · arXiv
This work introduces MELA-TTS, a novel joint transformer-diffusion framework for end-to-end text-to-speech synthesis. By autoregressively generating continuous mel-spectrogram frames from linguistic and speaker conditions, our architecture eliminates the need for speech tokenizat...
Simon Welker, Tal Peer, Timo Gerkmann · arXiv
The task of Mel vocoding, i.e., the inversion of a Mel magnitude spectrogram to an audio waveform, is still a key component in many text-to-speech (TTS) systems today. Based on generative flow matching, our prior work on generative STFT phase retrieval (DiffPhase), and the pseudo...
Mingchen Shao, Bingshen Mu, Chengyou Wang ... · arXiv
Speech large language models (SLLMs) built on speech encoders, adapters, and LLMs demonstrate remarkable multitask understanding performance in high-resource languages such as English and Chinese. However, their effectiveness substantially degrades in low-resource languages such ...
Yuanjian Chen, Yang Xiao, Jinjie Huang · arXiv
Multimodal acoustic event classification plays a key role in audio-visual systems. Although combining audio and visual signals improves recognition, it is still difficult to align them over time and to reduce the effect of noise across modalities. Existing methods often treat aud...
Kartik Hegde, Rehana Mahfuz, Yinyi Guo ... · arXiv
Current audio captioning systems rely heavily on supervised learning with paired audio-caption datasets, which are expensive to curate and may not reflect human preferences in real-world scenarios. To address this limitation, we propose a preference-aligned audio captioning frame...
Théo Charlot, Tarek Kunze, Maxime Poli ... · arXiv
Child-centered long-form recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, the first self-supervised speech representation mo...
Samuel J. Broughton, Lahiru Samarakoon · In Proc. Interspeech 2025 (pp. 5218-5222) · In Proc. Interspeech 2025 (pp. 5218-5222)
In this paper, we present state-of-the-art diarization error rates (DERs) on multiple publicly available datasets, including AliMeeting-far, AliMeeting-near, AMI-Mix, AMI-SDM, DIHARD III, and MagicData RAMC. Leveraging EEND-TA, a single unified non-autoregressive model for end-to...
Anton Selitskiy, Akib Shahriyar, Jishnuraj Prakasan · arXiv
In this paper, we show that discrete optimal transport (DOT) is an effective black-box adversarial attack against modern audio anti-spoofing countermeasures (CMs). Our attack operates as a post-processing, distribution-alignment step: frame-level WavLM embeddings of generated spe...
Xiaolei Xu, Chaoyue Niu, Guy J. Brown ... · arXiv
Obstructive sleep apnoea (OSA) is a prevalent condition with significant health consequences, yet many patients remain undiagnosed due to the complexity and cost of over-night polysomnography. Acoustic-based screening provides a scalable alternative, yet performance is limited by...
Wednesday, September 17, 2025
Hui-Peng Du, Yang Ai, Zhen-Hua Ling · Accepted by APSIPA ASC 2025
The majority of mainstream neural vocoders primarily focus on speech quality and generation speed, while overlooking latency, which is a critical factor in real-time applications. Excessive latency leads to noticeable delays in user interaction, severely degrading the user experi...
Junan Zhang, Yunjia Zhang, Xueyao Zhang ... · arXiv
Singing Accompaniment Generation (SAG) is the process of generating instrumental music for a given clean vocal input. However, existing SAG techniques use source-separated vocals as input and overfit to separation artifacts. This creates a critical train-test mismatch, leading to...
Eric Zhang, Li Wei, Sarah Chen ... · arXiv
Stuttered and dysfluent speech detection systems have traditionally suffered from the trade-off between accuracy and clinical interpretability. While end-to-end deep learning models achieve high performance, their black-box nature limits clinical adoption. This paper looks at the...
Seungmin Seo, Oleg Aulov, P. Jonathon Phillips · arXiv
We use the term re-identification to refer to the process of recovering the original speaker's identity from anonymized speech outputs. Speaker de-identification systems aim to reduce the risk of re-identification, but most evaluations focus only on individual-level measures and ...
Fei Liu, Yang Ai, Zhen-Hua Ling · Accepted by APSIPA2025
This paper proposes APSS, a novel neural speech separation model with parallel amplitude and phase spectrum estimation. Unlike most existing speech separation methods, the APSS distinguishes itself by explicitly estimating the phase spectrum for more complete and accurate separat...
Younghoo Kwon, Dongheon Lee, Dohwan Kim ... · 5 pages, 2 figures, submitted to DCASE workshop 2025
This paper introduces a multi-stage self-directed framework designed to address the spatial semantic segmentation of sound scene (S5) task in the DCASE 2025 Task 4 challenge. This framework integrates models focused on three distinct tasks: Universal Sound Separation (USS), Singl...
Justin Lovelace, Rithesh Kumar, Jiaqi Su ... · arXiv
While generative Text-to-Speech (TTS) systems leverage vast ``in-the-wild" data to achieve remarkable success, speech-to-speech processing tasks like enhancement face data limitations, which lead data-hungry generative approaches to distort speech content and speaker identity. To...
Xikun Lu, Fang Liu, Weizhi Shi ... · arXiv
High-fidelity binaural audio synthesis is crucial for immersive listening, but existing methods require extensive computational resources, limiting their edge-device application. To address this, we propose the Lightweight Implicit Neural Network (LINN), a novel two-stage framewo...
Shun Huang, Zhihua Fang, Liang He · Accepted ICASSP 2025
Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines rema...
Jungwoo Heo, Hyun-seo Shin, Chan-yeong Lim ... · 8 pages, 5 figures, accepted at IEEE ASRU 2025
Self-supervised learning (SSL) has pushed speaker verification accuracy close to state-of-the-art levels, but the Transformer backbones used in most SSL encoders hinder on-device and real-time deployment. Prior compression work trims layer depth or width yet still inherits the qu...
Janne Laakkonen, Ivan Kukanov, Ville Hautamäki · arXiv
Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in ...
Tuesday, September 16, 2025
Jingyu Li, Guangyan Zhang, Zhen Ye ... · arXiv
Audio codecs are a critical component of modern speech generation systems. This paper introduces a low-bitrate, multi-scale residual codec that encodes speech into four distinct streams: semantic, timbre, prosody, and residual. This architecture achieves high-fidelity speech reco...
Yudong Yang, Xiaokang Liu, Shaofeng zhao ... · arXiv
Speech therapy plays a critical role in training speech disorders caused by neurological impairments such as stroke. However, traditional manual and computer-assisted systems are limited in real-time accessibility and articulatory motion feedback, constraining their practical uti...
Arnab Kumar Roy, Hemant Kumar Kathania, Paban Sapkota · arXiv
Dysarthric speech severity classification is crucial for objective clinical assessment and progress monitoring in individuals with motor speech disorders. Although prior methods have addressed this task, achieving robust generalization in speaker-independent (SID) scenarios remai...
Yujie Guo, Jiaming Zhou, Yuhang Jia ... · arXiv
End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech, especially under high-overlap conditions. To address these challenges, we proposed Global-Local Aware Dynamic (GLAD) Mixture-of-Experts, which ...
Han Yin, Jung-Woo Choi · arXiv
Recently, Large Audio Language Models (LALMs) have progressed rapidly, demonstrating their strong efficacy in universal audio understanding through cross-modal integration. To evaluate LALMs' audio understanding performance, researchers have proposed different benchmarks. However...
Zhan Jin, Bang Zeng, Peijun Yang ... · arXiv
Target Speaker Extraction (TSE) is a critical challenge in cocktail party scenarios. While leveraging multiple modalities, such as voice, lip, face, and expression embeddings, can enhance performance, real-world applications often suffer from intermittent modality dropout. This p...
Monday, September 15, 2025
Wen-Yung Wu, Pei-Chin Hsieh, Tai-Shih Chi · arXiv
Voice activity detection (VAD) is essential in speech-based systems, but traditional methods detect only speech presence without identifying speakers. Target-speaker VAD (TS-VAD) extends this by detecting the speech of a known speaker using a short enrollment utterance, but this ...
Milan Marocchi, Matthew Fynn, Kayapanda Mandana ... · arXiv
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, accounting for approximately 17.9 million deaths each year. Early detection is critical, creating a demand for accurate and inexpensive pre-screening methods. Deep learning has recently been applied to class...
Adhiraj Banerjee, Vipul Arora · arXiv
Text-guided sound separation supports flexible audio editing across media and assistive applications, but existing models like AudioSep are too compute-heavy for edge deployment. Neural audio codec (NAC) models such as CodecFormer and SDCodec are compute-efficient but limited to ...
Sunday, September 14, 2025
Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin ... · arXiv
Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic repre...
Emmanouil Karystinaios · Accepted at Large Language Models for Music & Audio Workshop (LLM4MA) 2025
Agentic AI has been standardized in industry as a practical paradigm for coordinating specialized models and tools to solve complex multimodal tasks. In this work, we present WeaveMuse, a multi-agent system for music understanding, symbolic composition, and audio synthesis. Each ...