Audio ML Papers

Last 7 Days (February 03 - February 10, 2026)

Subcategories: All (27) | Speech Synthesis (4) | Music Synthesis (1) | Ambient Synthesis (5) | Quality Assessment (0) | Enhancement (3) | Asr (1) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Chang Li, Kanglei Zhou, Liyuan Wang · ICLR 2026
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present t...
#2 TOP PAPER (Score: 91)
Xuenan Xu, Yiming Ren, Liwei Liu ... · arXiv
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semant...
#3 TOP PAPER (Score: 83)
Michael Küttner, Valeria Zitz, Supraja Ramesh ... · arXiv
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of p...
Friday, February 06, 2026
Yuancheng Wang, Zhenyu Tang, Yun Wang ... · arXiv
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose...
Ziyu Luo, Lin Chen, Qiang Qu ... · arXiv
Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the d...
Hugo Seuté, Pranai Vasudev, Etienne Richan ... · Temporary pre-print, will be updated. In review at a conference
Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses o...
Thursday, February 05, 2026
Chunyat Wu, Jiajun Deng, Zhengxi Liu ... · ICASSP 2026
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of...
Kaiyuan Zhang, Mohan Shi, Eray Eren ... · arXiv
Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation...
Qing Wen, Haohao Li, Zhongjie Ba ... · arXiv
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise rel...
Haoqin Sun, Chenyang Lyu, Shiwan Zhao ... · arXiv
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints req...
Wednesday, February 04, 2026
Xuenan Xu, Yiming Ren, Liwei Liu ... · arXiv
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semant...
Haina Zhu, Yao Xiao, Xiquan Li ... · arXiv
We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for ...
Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich ... · arXiv
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their...
Georgii Aparin, Tasnima Sadekova, Alexey Rukhovich ... · EACL 2026, main track
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their...
Tuan Dat Phuong, Duc-Tuan Truong, Long-Vu Hoang ... · ICASSP 2026
Transformer-based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi-head self-attention (MHSA) mechanism. MHSA provides frame-level attention scores, which are particularly valuable because deepfake artifacts often oc...
Amir Ivry, Shinji Watanabe · arXiv
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LAL...
Dongchao Yang, Yuanyuan Wang, Dading Chong ... · arXiv
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot sett...
Dongchao Yang, Yuanyuan Wang, Dading Chong ... · arXiv
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot sett...
Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang ... · IEEE Transactions on Audio, Speech and Language Processing
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, part...
Vikentii Pankov, Artem Gribul, Oktai Tatanov ... · ICASSP 2026
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining ...
Tuesday, February 03, 2026
Chang Li, Kanglei Zhou, Liyuan Wang · ICLR 2026
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present t...
Shunxi Xu, Thushara Abhayapala, Craig T. Jin · ICASSP 2026
We propose a data-driven sparse recovery framework for hybrid spherical linear microphone arrays using singular value decomposition (SVD) of the transfer operator. The SVD yields orthogonal microphone and field modes, reducing to spherical harmonics (SH) in the SMA-only case, whi...
Siyi Wang, Shihong Tan, Siyi Liu ... · arXiv
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing af...
Hugo Malard, Gael Le Lan, Daniel Wong ... · arXiv
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misa...
Hugo Malard, Gael Le Lan, Daniel Wong ... · arXiv
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misa...
Michael Küttner, Valeria Zitz, Supraja Ramesh ... · arXiv
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of p...
Goksenin Yuksel, Marcel van Gerven, Kiki van der Heijden · arXiv
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments wi...
Seohyun Joo, Yoori Oh · arXiv
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully le...
Seohyun Joo, Yoori Oh · arXiv
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully le...
Xi Xuan, Davide Carbone, Ruchi Pandey ... · IEEE Signal Processing Letters
Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL f...