Audio ML Papers

Last 7 Days (February 18 - February 25, 2026)

Subcategories: All (20) | Speech Synthesis (3) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (3) | Asr (1) | Other (11)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Yisi Liu, Nicholas Lee, Gopala Anumanchipalli · arXiv
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited...
#2 TOP PAPER (Score: 84)
Karan Thakkar, Mounya Elhilali · ICASSP 2026
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Pr...
#3 TOP PAPER (Score: 83)
Yuma Shirahata, Ryuichi Yamamoto · ICASSP 2026
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which en...
Monday, February 23, 2026
Yisi Liu, Nicholas Lee, Gopala Anumanchipalli · arXiv
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited...
Karan Thakkar, Mounya Elhilali · ICASSP 2026
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Pr...
Sifei Li, Yang Li, Zizhou Wang ... · ICLR 2026
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through m...
Yue Pan, Xingyao Wang, Hanyue Zhang ... · arXiv
Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional c...
Yungang Yi · arXiv
Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instru...
Nghia Phan, Rong Jin, Gang Liu ... · arXiv
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we prese...
Sunday, February 22, 2026
Qibing Bai, Shuhao Shi, Shuai Wang ... · ICASSP 2026
Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generat...
Saturday, February 21, 2026
Hao Yen, Pin-Jui Ku, Ante Jukić ... · arXiv
In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked...
Youjun Chen, Guinan Li, Mengzhe Geng ... · ICASSP 2026
This paper highlights the critical importance of multi-channel speech enhancement (MCSE) for speech emotion recognition (ER) in cocktail party scenarios. A multi-channel speech dereverberation and separation front-end integrating DNN-WPE and mask-based MVDR is used to extract the...
Kwanghee Choi, Eunjung Yeo, Cheol Jun Cho ... · arXiv
Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular atten...
Friday, February 20, 2026
Arianna Francesconi, Zhixiang Dai, Arthur Stefano Moscheni ... · arXiv
Voice-based digital biomarkers can enable scalable, non-invasive screening and monitoring of Parkinson's disease (PD) and Amyotrophic Lateral Sclerosis (ALS). However, models trained on one cohort or device often fail on new acquisition settings due to cross-device and cross-coho...
Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka ... · ICASSP 2026
In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a nove...
Dahan Wang, Jun Gao, Tong Lei ... · 40th AAAI Conference on Artificial Intelligence (AAAI-26)
Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this pap...
Jilan Xu, Carl Thomé, Danijela Horak ... · arXiv
Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to under...
Thursday, February 19, 2026
William Chen, Prem Seetharaman, Rithesh Kumar ... · arXiv
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio process...
Yuma Shirahata, Ryuichi Yamamoto · ICASSP 2026
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which en...
Wednesday, February 18, 2026
Potsawee Manakul, Woody Haosheng Gan, Martijn Bartelds ... · arXiv
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply nex...
Houtan Ghaffari, Lukas Rauch, Christoph Scholz ... · arXiv
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potenti...
Prem Seetharaman, Oriol Nieto, Justin Salamon · ICASSP 2026
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a ...
Emilio Picard, Diego Di Carlo, Aditya Arie Nugraha ... · ICASSP, May 2026, Barcelone, Spain · ICASSP, May 2026, Barcelona, Spain
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering t...