Audio ML Papers

Last 7 Days (February 12 - February 19, 2026)

Subcategories: All (20) | Speech Synthesis (3) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Assessment (1) | Enhancement (1) | Asr (4) | Other (11)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
#2 TOP PAPER (Score: 84)
Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri ... · Speech, Music and Mind (SMM26) workshop at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-...
#3 TOP PAPER (Score: 83)
Xingyu Chen, Hanwen Bi, Fei Ma ... · arXiv
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject ...
Tuesday, February 17, 2026
Sonal Kumar, Prem Seetharaman, Ke Chen ... · arXiv
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions a...
Samir Sadok, Laurent Girin, Xavier Alameda-Pineda · arXiv
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such varia...
Jonah Casebeer, Ge Zhu, Zhepei Wang ... · arXiv
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent ...
Monday, February 16, 2026
Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar ... · arXiv
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models sh...
Zineb Lahrichi, Gaëtan Hadjeres, Gaël Richard ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2026, Barcelone, Spain · International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing meth...
Sunday, February 15, 2026
Reda Bensaid, Amine Ouasfi, Yassir Bendou ... · arXiv
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through...
Ziyang Ma, Ruiyang Xu, Yinghao Ma ... · arXiv
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) qua...
H. M. Shadman Tabib, Istiak Ahmmed Rifti, Abdullah Muhammed Amimul Ehsan ... · arXiv
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a re...
Dan Zhang, Yishu Lei, Jing Hu ... · arXiv
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrat...
Saturday, February 14, 2026
Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri ... · Speech, Music and Mind (SMM26) workshop at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-...
Siqian Tong, Xuan Li, Yiwei Wang ... · arXiv
Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tool...
Zhe Ye, Xiangui Kang, Jiayi He ... · arXiv
As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due ...
Amro Asali, Yehuda Ben-Shimol, Itshak Lapidot · arXiv
Spoofing-robust automatic speaker verification (SASV) seeks to build automatic speaker verification systems that are robust against both zero-effort impostor attacks and sophisticated spoofing techniques such as voice conversion (VC) and text-to-speech (TTS). In this work, we pro...
Maohao Shen, Tejas Jayashankar, Osama Hanna ... · arXiv
Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality ...
Sripathi Sridhar, Prem Seetharaman, Oriol Nieto ... · ICASSP 2026
Sound designers search for sounds in large sound effects libraries using aspects such as sound class or visual context. However, the metadata needed for such search is often missing or incomplete, and requires significant manual effort to add. Existing solutions to automate this ...
Friday, February 13, 2026
Giovanni Bologni, Nicolás Arrieta Larraza, Richard Heusdens ... · arXiv
Deep Neural Networks (DNNs) often struggle to suppress noise at low signal-to-noise ratios (SNRs). This paper addresses speech enhancement in scenarios dominated by harmonic noise and proposes a framework that integrates cyclostationarity-aware preprocessing with lightweight DNN-...
Jaeyoung Lee, Masato Mimura · ICASSP 2026
We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert...
Thursday, February 12, 2026
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
Xingyu Chen, Hanwen Bi, Fei Ma ... · arXiv
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject ...
Yifan Liang, Andong Li, Kang Yang ... · arXiv
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion ...