Audio ML Papers

Last 7 Days (February 15 - February 22, 2026)

Subcategories: All (16) | Speech Synthesis (2) | Music Synthesis (1) | Ambient Synthesis (1) | Quality Assessment (1) | Enhancement (0) | Asr (2) | Other (9)
← Previous Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 90)
Pengfei Zhang, Tianxin Xie, Minghao Yang ... · The Fourteenth International Conference on Learning Representations (ICLR 2026) · The Fourteenth International Conference on Learning Representations (ICLR 2026)
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe ...
#2 TOP PAPER (Score: 83)
Reda Bensaid, Amine Ouasfi, Yassir Bendou ... · arXiv
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through...
#3 TOP PAPER (Score: 83)
Ziyang Ma, Ruiyang Xu, Yinghao Ma ... · arXiv
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) qua...
Thursday, February 19, 2026
William Chen, Prem Seetharaman, Rithesh Kumar ... · arXiv
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio process...
Yuma Shirahata, Ryuichi Yamamoto · ICASSP 2026
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which en...
Wednesday, February 18, 2026
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds ... · arXiv
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply nex...
Houtan Ghaffari, Lukas Rauch, Christoph Scholz ... · arXiv
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potenti...
Prem Seetharaman, Oriol Nieto, Justin Salamon · ICASSP 2026
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a ...
Tuesday, February 17, 2026
Sonal Kumar, Prem Seetharaman, Ke Chen ... · arXiv
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions a...
Samir Sadok, Laurent Girin, Xavier Alameda-Pineda · arXiv
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such varia...
Jonah Casebeer, Ge Zhu, Zhepei Wang ... · arXiv
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent ...
Adnan El Assadi, Isaac Chung, Chenghao Xiao ... · arXiv
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks...
Monday, February 16, 2026
Pengfei Zhang, Tianxin Xie, Minghao Yang ... · The Fourteenth International Conference on Learning Representations (ICLR 2026) · The Fourteenth International Conference on Learning Representations (ICLR 2026)
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe ...
Naveen Vakada, Kartik Hegde, Arvind Krishna Sridhar ... · arXiv
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models sh...
Zineb Lahrichi, Gaëtan Hadjeres, Gaël Richard ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP), May 2026, Barcelone, Spain · International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing meth...
Sunday, February 15, 2026
Reda Bensaid, Amine Ouasfi, Yassir Bendou ... · arXiv
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through...
Ziyang Ma, Ruiyang Xu, Yinghao Ma ... · arXiv
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) qua...
H. M. Shadman Tabib, Istiak Ahmmed Rifti, Abdullah Muhammed Amimul Ehsan ... · arXiv
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a re...
Dan Zhang, Yishu Lei, Jing Hu ... · arXiv
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrat...