Audio ML Papers

Last 7 Days (January 28 - February 04, 2026)

Subcategories: All (40) | Speech Synthesis (8) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Assessment (3) | Enhancement (7) | Asr (6) | Other (12)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Artem Dementyev, Wazeer Zulfikar, Sinan Hersek ... · arXiv
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder,...
#2 TOP PAPER (Score: 85)
Qingran Yang, Botao Zhao, Zuheng Kang ... · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in dist...
#3 TOP PAPER (Score: 84)
Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda ... · IEEE ICASSP 2026
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning...
Monday, February 02, 2026
Qingran Yang, Botao Zhao, Zuheng Kang ... · IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in dist...
Chenxu Guo, Jiachen Lian, Yisi Liu ... · arXiv
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and ...
Jaejun Lee, Yoori Oh, Kyogu Lee · ICASSP 2026
Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such...
Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang ... · arXiv
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-...
Fei Liu, Yang Ai · ICASSP 2026
Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech e...
Sunday, February 01, 2026
Chengyuan Ma, Peng Jia, Hongyue Guo ... · ICASSP 2026
Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this cha...
Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang ... · Proceedings of The Web Conference 2026 (WWW'26), short track
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing d...
Yang Xiao, Eun-Jung Holden, Ting Dang · arXiv
Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensi...
Saturday, January 31, 2026
Ilyass Moummad, Marius Miron, Lukas Rauch ... · arXiv
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient ap...
Ke Xue, Rongfei Fan, Kai Li ... · arXiv
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in ine...
Yong Ren, Jiangyan Yi, Jianhua Tao ... · arXiv
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglem...
Hao Ma, Ruihao Jing, Shansong Liu ... · arXiv
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform r...
Xinting Liao, Ruinan Jin, Hanlin Yu ... · arXiv
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter nois...
Junmin Gong, Yulin Song, Wenxiao Zhao ... · arXiv
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- ...
Friday, January 30, 2026
Kai Li, Jintao Cheng, Chang Zeng ... · arXiv
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation...
Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda ... · IEEE ICASSP 2026
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning...
Genshun Wan, Wenhui Zhang, Jing-Xuan Zhang ... · ICASSP 2026
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that inte...
Xiaoxuan Guo, Yuankun Xie, Haonan Cheng ... · arXiv
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantic...
Li Zhou, Hao Jiang, Junjie Li ... · ICASSP 2026
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion em...
Yong Ren, Jingbei Li, Haiyang Sun ... · arXiv
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with ...
Jiaming Zhou, Xuxin Cheng, Shiwan Zhao ... · arXiv
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion ...
Mikko Heikkinen, Archontis Politis, Konstantinos Drossos ... · ICASSP 2026
We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous met...
Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi ... · ICASSP 2026
To advance immersive communication, the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge recently introduced Task 4 on Spatial Semantic Segmentation of Sound Scenes (S5). An S5 system takes a multi-channel audio mixture as input and outputs single...
Thursday, January 29, 2026
Bing Han, Chushu Zhou, Yifan Yang ... · arXiv
Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex a...
Zheqi Dai, Guangyan Zhang, Haolin He ... · arXiv
In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder...
Cheol Jun Cho, Nicholas Lee, Alan W Black ... · arXiv
Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. ...
Aref Farhadipour, Jan Marquenie, Srikanth Madikeri ... · arXiv
The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field's reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challe...
Xiuwen Zheng, Sixun Dong, Bornali Phukon ... · ICASSP 2026
While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause se...
Jun Xue, Yi Chai, Yanzhen Ren ... · arXiv
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefo...
Tom Gajecki, Jonas Althoff, Waldo Nogueira · arXiv
We propose a brain-informed speech separation method for cochlear implants (CIs) that uses electroencephalography (EEG)-derived attention cues to guide enhancement toward the attended speaker. An attention-guided network fuses audio mixtures with EEG features through a lightweigh...
Yihui Fu, Tim Fingscheidt · IEEE ICASSP 2026
Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore...
Wednesday, January 28, 2026
Artem Dementyev, Wazeer Zulfikar, Sinan Hersek ... · arXiv
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder,...
Annie Chu, Hugo Flores García, Oriol Nieto ... · ICASSP 2026
We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integr...
Yigitcan Özer, Wanying Ge, Zhe Zhang ... · IEICE, SP/SLP 2026
Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding cap...
Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis ... · ICASSP 2026
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech a...
Sergio Burdisso, Esaú Villatoro-Tello, Andrés Carofilis ... · ICASSP 2026
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech a...
Robin Singh, Aditya Yogesh Nair, Fabio Palumbo ... · arXiv
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--represent...
Myungjin Lee, Eunji Shin, Jiyoung Lee · arXiv
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upo...
Xiangbo Wang, Wenbin Jiang, Jin Wang ... · ICASSP 2026
Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex...
Ryan Whetten, Titouan Parcollet, Marco Dinarelli ... · IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2026)
Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine h...