Audio ML Papers

Last 7 Days (February 17 - February 24, 2026)

Subcategories: All (14) | Speech Synthesis (2) | Music Synthesis (0) | Ambient Synthesis (1) | Quality Assessment (0) | Enhancement (2) | Asr (0) | Other (9)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Sonal Kumar, Prem Seetharaman, Ke Chen ... · arXiv
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions a...
#2 TOP PAPER (Score: 83)
Samir Sadok, Laurent Girin, Xavier Alameda-Pineda · arXiv
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such varia...
#3 TOP PAPER (Score: 83)
Yuma Shirahata, Ryuichi Yamamoto · ICASSP 2026
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which en...
Friday, February 20, 2026
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka ... · ICASSP 2026
In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a nove...
Dahan Wang, Jun Gao, Tong Lei ... · 40th AAAI Conference on Artificial Intelligence (AAAI-26)
Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this pap...
Jilan Xu, Carl Thomé, Danijela Horak ... · arXiv
Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to under...
Thursday, February 19, 2026
William Chen, Prem Seetharaman, Rithesh Kumar ... · arXiv
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio process...
Yuma Shirahata, Ryuichi Yamamoto · ICASSP 2026
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which en...
Wednesday, February 18, 2026
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds ... · arXiv
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply nex...
Houtan Ghaffari, Lukas Rauch, Christoph Scholz ... · arXiv
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potenti...
Prem Seetharaman, Oriol Nieto, Justin Salamon · ICASSP 2026
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a ...
Emilio Picard, Diego Di Carlo, Aditya Arie Nugraha ... · ICASSP, May 2026, Barcelone, Spain · ICASSP, May 2026, Barcelona, Spain
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering t...
Tuesday, February 17, 2026
Sonal Kumar, Prem Seetharaman, Ke Chen ... · arXiv
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions a...
Samir Sadok, Laurent Girin, Xavier Alameda-Pineda · arXiv
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such varia...
Jonah Casebeer, Ge Zhu, Zhepei Wang ... · arXiv
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent ...
Jonah Casebeer, Ge Zhu, Zhepei Wang ... · arXiv
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent ...
Adnan El Assadi, Isaac Chung, Chenghao Xiao ... · arXiv
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks...