Audio ML Papers

Last 7 Days (February 10 - February 17, 2026)

Subcategories: All (14) | Speech Synthesis (5) | Music Synthesis (1) | Ambient Synthesis (1) | Quality Assessment (0) | Enhancement (1) | Asr (1) | Other (5)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
#2 TOP PAPER (Score: 84)
Jingru Lin, Chen Zhang, Tianrui Wang ... · Audio-AAAI
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus onl...
#3 TOP PAPER (Score: 83)
Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, h...
Friday, February 13, 2026
Giovanni Bologni, Nicolás Arrieta Larraza, Richard Heusdens ... · arXiv
Deep Neural Networks (DNNs) often struggle to suppress noise at low signal-to-noise ratios (SNRs). This paper addresses speech enhancement in scenarios dominated by harmonic noise and proposes a framework that integrates cyclostationarity-aware preprocessing with lightweight DNN-...
Jaeyoung Lee, Masato Mimura · ICASSP 2026
We present a decoder-only Conformer for automatic speech recognition (ASR) that processes speech and text in a single stack without external speech encoders or pretrained large language models (LLM). The model uses a modality-aware sparse mixture of experts (MoE): disjoint expert...
Thursday, February 12, 2026
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
Xingyu Chen, Hanwen Bi, Fei Ma ... · arXiv
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject ...
Yifan Liang, Andong Li, Kang Yang ... · arXiv
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion ...
Wednesday, February 11, 2026
Jingru Lin, Chen Zhang, Tianrui Wang ... · Audio-AAAI
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus onl...
Liyang Chen, Hongkai Chen, Yujun Cai ... · arXiv
Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize percep...
Yitian Gong, Kuangwei Chen, Zhaoye Fei ... · arXiv
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures....
Yitian Gong, Kuangwei Chen, Zhaoye Fei ... · arXiv
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures....
Yuanxin Xia, Xinyan Li, Matteo Calafà ... · arXiv
Standardized laboratory characterizations for absorbing materials rely on idealized sound field assumptions, which deviate largely from real-life conditions. Consequently, \emph{in-situ} acoustic characterization has become essential for accurate diagnosis and virtual prototyping...
Tuesday, February 10, 2026
Heitor R. Guimarães, Abhishek Tiwari, Mahsa Abdollahi ... · arXiv
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio enco...
Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, h...
Wenfu Wang, Chenxing Li, Liqiang Zhang ... · arXiv
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-t...
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah ... · arXiv
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global emb...