Audio ML Papers

Week of January 25 - February 01, 2026

Subcategories: All (65) | Speech Synthesis (10) | Music Synthesis (3) | Ambient Synthesis (3) | Quality Assessment (3) | Enhancement (9) | Asr (13) | Other (24)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Artem Dementyev, Wazeer Zulfikar, Sinan Hersek ... · arXiv
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder,...
#2 TOP PAPER (Score: 84)
Jingyao Wu, Grace Lin, Yinuo Song ... · ICASSP 2026
Emotion recognition is inherently ambiguous, with uncertainty arising both from rater disagreement and from discrepancies across modalities such as speech and text. There is growing interest in modeling rater ambiguity using label distributions. However, modality ambiguity remain...
#3 TOP PAPER (Score: 84)
Yiwen Shao, Yong Xu, Sanjeev Khudanpur ... · arXiv
Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel A...
Saturday, January 31, 2026
Ilyass Moummad, Marius Miron, Lukas Rauch ... · arXiv
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient ap...
Ke Xue, Rongfei Fan, Kai Li ... · arXiv
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in ine...
Yong Ren, Jiangyan Yi, Jianhua Tao ... · arXiv
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglem...
Hao Ma, Ruihao Jing, Shansong Liu ... · arXiv
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform r...
Xinting Liao, Ruinan Jin, Hanlin Yu ... · arXiv
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter nois...
Junmin Gong, Yulin Song, Wenxiao Zhao ... · arXiv
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- ...
Friday, January 30, 2026
Kai Li, Jintao Cheng, Chang Zeng ... · arXiv
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation...
Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda ... · IEEE ICASSP 2026
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning...
Genshun Wan, Wenhui Zhang, Jing-Xuan Zhang ... · ICASSP 2026
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that inte...
Xiaoxuan Guo, Yuankun Xie, Haonan Cheng ... · arXiv
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantic...
Li Zhou, Hao Jiang, Junjie Li ... · ICASSP 2026
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion em...
Yong Ren, Jingbei Li, Haiyang Sun ... · arXiv
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with ...
Seungu Han, Sungho Lee, Kyogu Lee · ICASSP 2026
Recent speech enhancement (SE) models increasingly leverage self-supervised learning (SSL) representations for their rich semantic information. Typically, intermediate features are aggregated into a single representation via a lightweight adaptation module. However, most SSL mode...
Jiaming Zhou, Xuxin Cheng, Shiwan Zhao ... · arXiv
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion ...
Mikko Heikkinen, Archontis Politis, Konstantinos Drossos ... · ICASSP 2026
We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous met...
Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi ... · ICASSP 2026
To advance immersive communication, the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge recently introduced Task 4 on Spatial Semantic Segmentation of Sound Scenes (S5). An S5 system takes a multi-channel audio mixture as input and outputs single...
Thursday, January 29, 2026
Yuchen Mao, Wen Huang, Yanmin Qian · arXiv
Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame-level predictions to identify spoofed segments, and some recent methods...
Bing Han, Chushu Zhou, Yifan Yang ... · arXiv
Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex a...
Zheqi Dai, Guangyan Zhang, Haolin He ... · arXiv
In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder...
Cheol Jun Cho, Nicholas Lee, Alan W Black ... · arXiv
Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. ...
Aref Farhadipour, Jan Marquenie, Srikanth Madikeri ... · arXiv
The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field's reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challe...
Xiuwen Zheng, Sixun Dong, Bornali Phukon ... · ICASSP 2026
While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause se...
Jun Xue, Yi Chai, Yanzhen Ren ... · arXiv
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefo...
Tom Gajecki, Jonas Althoff, Waldo Nogueira · arXiv
We propose a brain-informed speech separation method for cochlear implants (CIs) that uses electroencephalography (EEG)-derived attention cues to guide enhancement toward the attended speaker. An attention-guided network fuses audio mixtures with EEG features through a lightweigh...
Yihui Fu, Tim Fingscheidt · IEEE ICASSP 2026
Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore...
Wednesday, January 28, 2026
Artem Dementyev, Wazeer Zulfikar, Sinan Hersek ... · arXiv
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder,...
Annie Chu, Hugo Flores García, Oriol Nieto ... · ICASSP 2026
We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integr...
Yigitcan Özer, Wanying Ge, Zhe Zhang ... · IEICE, SP/SLP 2026
Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding cap...
Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis ... · ICASSP 2026
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech a...
Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis ... · ICASSP 2026
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech a...
Robin Singh, Aditya Yogesh Nair, Fabio Palumbo ... · arXiv
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--represent...
Myungjin Lee, Eunji Shin, Jiyoung Lee · arXiv
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upo...
Xiangbo Wang, Wenbin Jiang, Jin Wang ... · ICASSP 2026
Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex...
Ryan Whetten, Titouan Parcollet, Marco Dinarelli ... · IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine h...
Tuesday, January 27, 2026
Haohan Shi, Xiyu Shi, Safak Dogan ... · ICASSP 2026
This paper focuses on audio deepfake detection under real-world communication degradations, with an emphasis on ultra-short inputs (0.5-2.0s), targeting the capability to detect synthetic speech at a conversation opening, e.g., when a scammer says "Hi." We propose Short-MGAA (S-M...
Zexu Pan, Xinyuan Qian, Shengkui Zhao ... · arXiv
Most audio-visual speaker extraction methods rely on synchronized lip recording to isolate the speech of a target speaker from a multi-talker mixture. However, in natural human communication, co-speech gestures are also temporally aligned with speech, often emphasizing specific w...
Zhen Liao, Gaole Dai, Mengqiao Chen ... · ICASSP 2026
Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and ma...
Zexu Pan, Shengkui Zhao, Yukun Ma ... · arXiv
Most universal sound extraction algorithms focus on isolating a target sound event from single-channel audio mixtures. However, the real world is three-dimensional, and binaural audio, which mimics human hearing, can capture richer spatial information, including sound source loca...
Congyi Fan, Jian Guan, Youtian Lin ... · arXiv
Spatial audio is essential for immersive experiences, yet novel-view acoustic synthesis (NVAS) remains challenging due to complex physical phenomena such as reflection, diffraction, and material absorption. Existing methods based on single-view or panoramic inputs improve spatial...
Helin Wang, Bowen Shi, Andros Tjandra ... · arXiv
The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real...
Alexander Polok, Dominik Klement, Samuele Cornell ... · ICASSP 2026
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarizat...
Bharath Krishnamurthy, Ajita Rattani · IEEE ICASSP 2026
Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in ...
Zhihua Fang, Liang He · ICASSP 2026
Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarch...
Haibin Wu, Bach Viet Do, Naveen Suda ... · ICASSP 2026
Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder, which employs a hybrid transformer and c...
Yuxiang Wang, Hongyu Liu, Dekun Chen ... · arXiv
As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could re...
Tianhua Li, Chenda Li, Wei Wang ... · arXiv
Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performanc...
Yinghao Liu, Chengwei Liu, Xiaotao Liang ... · ICASSP 2026
Universal speech enhancement aims at handling inputs with various speech distortions and recording conditions. In this work, we propose a novel hybrid architecture that synergizes the signal fidelity of discriminative modeling with the reconstruction capabilities of generative mo...
Monday, January 26, 2026
Alexander Buck, Georgina Cosma, Iain Phillips ... · arXiv
Explainable AI (XAI) is commonly applied to anomalous sound detection (ASD) models to identify which time-frequency regions of an audio signal contribute to an anomaly decision. However, most audio explanations rely on qualitative inspection of saliency maps, leaving open the que...
Steven Vander Eeckt, Hugo Van hamme · IEEE Transactions on Audio, Speech, and Language Processing
Continual Learning (CL) in Automatic Speech Recognition (ASR) suffers from catastrophic forgetting when adapting to new tasks, domains, or speakers. A common strategy to mitigate this is to store a subset of past data in memory for rehearsal. However, rehearsal-based methods face...
Kohei Asai, Wataru Nakata, Yuki Saito ... · ICASSP 2025 workshop
Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE)...
Parampreet Singh, Somya Kumar, Chaitanya Shailendra Nitawe ... · NCC 2026 conference
Raga identification in Indian Art Music (IAM) remains challenging due to the presence of numerous rarely performed Ragas that are not represented in available training datasets. Traditional classification models struggle in this setting, as they assume a closed set of known categ...
Bingshen Mu, Xian Shi, Xiong Wang ... · arXiv
Forced alignment (FA) predicts start and end timestamps for words or characters in speech, but existing methods are language-specific and prone to cumulative temporal shifts. The multilingual speech understanding and long-sequence processing abilities of speech large language mod...
Zhengyang Li, Thomas Graave, Björn Möller ... · ICASSP 2026
In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective v...
Junli Chen, Changli Tang, Yixuan Li ... · arXiv
Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. D...
Wei Wang, Wangyou Zhang, Chenda Li ... · arXiv
Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarc...
Mengcheng Huang, Xue Zhou, Chen Xu ... · arXiv
Underwater acoustic target recognition (UATR) plays a vital role in marine applications but remains challenging due to limited labeled data and the complexity of ocean environments. This paper explores a central question: can speech large models (SLMs), trained on massive human s...
Wenhao Zou, Yuwei Miao, Zhanyu Ma ... · arXiv
Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishe...
Zhichao Wang, Tao Li, Wenshuo Ge ... · arXiv
Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-...
Jai Dhiman · arXiv
Automated piano performance evaluation traditionally relies on symbolic (MIDI) representations, which capture note-level information but miss the acoustic nuances that characterize expressive playing. I propose using pre-trained audio foundation models, specifically MuQ and MERT,...
Zhiliang Peng, Jianwei Yu, Yaoyao Chang ... · arXiv
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent ...
Sunday, January 25, 2026
Jingyao Wu, Grace Lin, Yinuo Song ... · ICASSP 2026
Emotion recognition is inherently ambiguous, with uncertainty arising both from rater disagreement and from discrepancies across modalities such as speech and text. There is growing interest in modeling rater ambiguity using label distributions. However, modality ambiguity remain...
Yiwen Shao, Yong Xu, Sanjeev Khudanpur ... · arXiv
Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel A...
Wenjie Tian, Bingshen Mu, Guobin Ma ... · arXiv
Automatic speech recognition (ASR) systems based on large language models (LLMs) achieve superior performance by leveraging pretrained LLMs as decoders, but their token-by-token generation mechanism leads to inference latency that grows linearly with sequence length. Meanwhile, d...
Anfeng Xu, Tiantian Feng, Somer Bishop ... · arXiv
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization ...
Md Sazzadul Islam Ridoy, Mubaswira Ibnat Zidney, Sumi Akter ... · 28th International Conference on Computer and Information Technology (ICCIT)
Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition (ASR) research, particularly under noisy and speaker-diverse conditions. This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on...