Audio ML Papers

Last 7 Days (February 11 - February 18, 2026)

Subcategories: All (20) | Speech Synthesis (5) | Music Synthesis (0) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (1) | Asr (4) | Other (10)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
#2 TOP PAPER (Score: 84)
Jingru Lin, Chen Zhang, Tianrui Wang ... · Audio-AAAI
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus onl...
#3 TOP PAPER (Score: 84)
Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri ... · Speech, Music and Mind (SMM26) workshop at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-...
Monday, February 16, 2026
Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar ... · arXiv
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models sh...
Sunday, February 15, 2026
Reda Bensaid, Amine Ouasfi, Yassir Bendou ... · arXiv
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through...
Ziyang Ma, Ruiyang Xu, Yinghao Ma ... · arXiv
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) qua...
H. M. Shadman Tabib, Istiak Ahmmed Rifti, Abdullah Muhammed Amimul Ehsan ... · arXiv
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a re...
Dan Zhang, Yishu Lei, Jing Hu ... · arXiv
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrat...
Saturday, February 14, 2026
Aju Ani Justus, Ruchit Agrawal, Sudarsana Reddy Kadiri ... · Speech, Music and Mind (SMM26) workshop at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
We present voice2mode, a method for classification of four singing phonation modes (breathy, neutral (modal), flow, and pressed) using embeddings extracted from large self-supervised speech models. Prior work on singing phonation has relied on handcrafted signal features or task-...
Siqian Tong, Xuan Li, Yiwei Wang ... · arXiv
Large Audio Language Models (LALMs) excel at perception but struggle with complex reasoning requiring precise acoustic measurements. While external tools can extract fine-grained features like exact tempo or pitch, effective integration remains challenging: naively using all tool...
Zhe Ye, Xiangui Kang, Jiayi He ... · arXiv
As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due ...
Amro Asali, Yehuda Ben-Shimol, Itshak Lapidot · arXiv
Spoofing-robust automatic speaker verification (SASV) seeks to build automatic speaker verification systems that are robust against both zero-effort impostor attacks and sophisticated spoofing techniques such as voice conversion (VC) and text-to-speech (TTS). In this work, we pro...
Maohao Shen, Tejas Jayashankar, Osama Hanna ... · arXiv
Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality ...
Friday, February 13, 2026
Giovanni Bologni, Nicolás Arrieta Larraza, Richard Heusdens ... · arXiv
Deep Neural Networks (DNNs) often struggle to suppress noise at low signal-to-noise ratios (SNRs). This paper addresses speech enhancement in scenarios dominated by harmonic noise and proposes a framework that integrates cyclostationarity-aware preprocessing with lightweight DNN-...
Jaeyoung Lee, Masato Mimura · ICASSP 2026
We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert...
Thursday, February 12, 2026
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
Xingyu Chen, Hanwen Bi, Fei Ma ... · arXiv
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject ...
Yifan Liang, Andong Li, Kang Yang ... · arXiv
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion ...
Wednesday, February 11, 2026
Jingru Lin, Chen Zhang, Tianrui Wang ... · Audio-AAAI
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus onl...
Liyang Chen, Hongkai Chen, Yujun Cai ... · arXiv
Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize percep...
Yitian Gong, Kuangwei Chen, Zhaoye Fei ... · arXiv
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures....
Yitian Gong, Kuangwei Chen, Zhaoye Fei ... · arXiv
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures....
Yuanxin Xia, Xinyan Li, Matteo Calafà ... · arXiv
Standardized laboratory characterizations for absorbing materials rely on idealized sound field assumptions, which deviate largely from real-life conditions. Consequently, \emph{in-situ} acoustic characterization has become essential for accurate diagnosis and virtual prototyping...