Audio ML Papers

Last 7 Days (January 30 - February 06, 2026)

Subcategories: All (46) | Speech Synthesis (12) | Music Synthesis (4) | Ambient Synthesis (4) | Quality Assessment (2) | Enhancement (7) | Asr (4) | Other (13)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 91)
Chang Li, Kanglei Zhou, Liyuan Wang ยท ICLR 2026
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present t...
#2 TOP PAPER (Score: 91)
Xuenan Xu, Yiming Ren, Liwei Liu ... ยท arXiv
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semant...
#3 TOP PAPER (Score: 85)
Qingran Yang, Botao Zhao, Zuheng Kang ... ยท IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in dist...
Wednesday, February 04, 2026
Xuenan Xu, Yiming Ren, Liwei Liu ... ยท arXiv
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semant...
Haina Zhu, Yao Xiao, Xiquan Li ... ยท arXiv
We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for ...
Tuan Dat Phuong, Duc-Tuan Truong, Long-Vu Hoang ... ยท ICASSP 2026
Transformer-based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi-head self-attention (MHSA) mechanism. MHSA provides frame-level attention scores, which are particularly valuable because deepfake artifacts often oc...
Amir Ivry, Shinji Watanabe ยท arXiv
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LAL...
Dongchao Yang, Yuanyuan Wang, Dading Chong ... ยท arXiv
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot sett...
Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang ... ยท IEEE Transactions on Audio, Speech and Language Processing
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, part...
Vikentii Pankov, Artem Gribul, Oktai Tatanov ... ยท ICASSP 2026
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining ...
Tuesday, February 03, 2026
Chang Li, Kanglei Zhou, Liyuan Wang ยท ICLR 2026
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present t...
Shunxi Xu, Thushara Abhayapala, Craig T. Jin ยท ICASSP 2026
We propose a data-driven sparse recovery framework for hybrid spherical linear microphone arrays using singular value decomposition (SVD) of the transfer operator. The SVD yields orthogonal microphone and field modes, reducing to spherical harmonics (SH) in the SMA-only case, whi...
Siyi Wang, Shihong Tan, Siyi Liu ... ยท arXiv
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing af...
Hugo Malard, Gael Le Lan, Daniel Wong ... ยท arXiv
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misa...
Hugo Malard, Gael Le Lan, Daniel Wong ... ยท arXiv
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misa...
Michael Kรผttner, Valeria Zitz, Supraja Ramesh ... ยท arXiv
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of p...
Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden ยท arXiv
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments wi...
Seohyun Joo, Yoori Oh ยท arXiv
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully le...
Xi Xuan, Davide Carbone, Ruchi Pandey ... ยท IEEE Signal Processing Letters
Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL f...
Monday, February 02, 2026
Qingran Yang, Botao Zhao, Zuheng Kang ... ยท IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in dist...
Chenxu Guo, Jiachen Lian, Yisi Liu ... ยท arXiv
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and ...
Jaejun Lee, Yoori Oh, Kyogu Lee ยท ICASSP 2026
Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such...
Rajalaxmi Rajagopalan, Ritwik Giri, Zhiqiang Tang ... ยท arXiv
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-...
Fei Liu, Yang Ai ยท ICASSP 2026
Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech e...
Abdoulaye Diack, Perry Nelson, Kwaku Agbesi ... ยท arXiv
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 langua...
Abdoulaye Diack, Perry Nelson, Kwaku Agbesi ... ยท arXiv
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 langua...
Xiaosha Li, Chun Liu, Ziyu Wang ยท IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from "garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease ...
Sunday, February 01, 2026
Chengyuan Ma, Peng Jia, Hongyue Guo ... ยท ICASSP 2026
Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this cha...
Chengyuan Ma, Jiawei Jin, Ruijie Xiong ... ยท ICASSP 2026
We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of dat...
Zhili Nicholas Liang, Soyeon Caren Han, Qizhou Wang ... ยท Proceedings of The Web Conference 2026 (WWW'26), short track
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing d...
Yochai Yemini, Yoav Ellinson, Rami Ben-Ari ... ยท arXiv
This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion p...
Yang Xiao, Eun-Jung Holden, Ting Dang ยท arXiv
Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensi...
Hong Jia, Weibin Li, Jingyao Wu ... ยท arXiv
Emotion recognition from human speech is a critical enabler for socially aware conversational AI. However, while most prior work frames emotion recognition as a categorical classification problem, real-world affective states are often ambiguous, overlapping, and context-dependent...
Saturday, January 31, 2026
Ilyass Moummad, Marius Miron, Lukas Rauch ... ยท arXiv
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient ap...
Ke Xue, Rongfei Fan, Kai Li ... ยท arXiv
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in ine...
Yong Ren, Jiangyan Yi, Jianhua Tao ... ยท arXiv
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglem...
Hao Ma, Ruihao Jing, Shansong Liu ... ยท arXiv
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform r...
Xinting Liao, Ruinan Jin, Hanlin Yu ... ยท arXiv
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter nois...
Junmin Gong, Yulin Song, Wenxiao Zhao ... ยท arXiv
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- ...
Friday, January 30, 2026
Kai Li, Jintao Cheng, Chang Zeng ... ยท arXiv
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation...
Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda ... ยท IEEE ICASSP 2026
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning...
Genshun Wan, Wenhui Zhang, Jing-Xuan Zhang ... ยท ICASSP 2026
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that inte...
Xiaoxuan Guo, Yuankun Xie, Haonan Cheng ... ยท arXiv
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantic...
Li Zhou, Hao Jiang, Junjie Li ... ยท ICASSP 2026
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion em...
Yong Ren, Jingbei Li, Haiyang Sun ... ยท arXiv
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with ...
Seungu Han, Sungho Lee, Kyogu Lee ยท ICASSP 2026
Recent speech enhancement (SE) models increasingly leverage self-supervised learning (SSL) representations for their rich semantic information. Typically, intermediate features are aggregated into a single representation via a lightweight adaptation module. However, most SSL mode...
Jiaming Zhou, Xuxin Cheng, Shiwan Zhao ... ยท arXiv
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion ...
Mikko Heikkinen, Archontis Politis, Konstantinos Drossos ... ยท ICASSP 2026
We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous met...
Binh Thien Nguyen, Masahiro Yasuda, Daiki Takeuchi ... ยท ICASSP 2026
To advance immersive communication, the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge recently introduced Task 4 on Spatial Semantic Segmentation of Sound Scenes (S5). An S5 system takes a multi-channel audio mixture as input and outputs single...