Audio ML Papers

Week of February 22 - March 01, 2026

Subcategories: All (23) | Speech Synthesis (3) | Music Synthesis (5) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (1) | Asr (3) | Other (9)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Yisi Liu, Nicholas Lee, Gopala Anumanchipalli · arXiv
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited...
#2 TOP PAPER (Score: 84)
Karan Thakkar, Mounya Elhilali · ICASSP 2026
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Pr...
#3 TOP PAPER (Score: 83)
Sifei Li, Yang Li, Zizhou Wang ... · ICLR 2026
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through m...
Saturday, February 28, 2026
Seunghyun Oh, Malek Itani, Aseem Gauri ... · arXiv
Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We int...
Yinghao Ma, Haiwen Xia, Hewei Gao ... · arXiv
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under ...
Sen Zhang, Jianguo Wei, Wenhuan Lu ... · ICASSP 2026
The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is pr...
Jinhan Xu, Xing Tang, Houpeng Yang ... · arXiv
Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to s...
Friday, February 27, 2026
Heinrich Dinkel, Xingwei Sun, Gang Li ... · arXiv
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this ...
Keita Goto, Takashi Maekaku, Jin Sakuma ... · ICASSP 2026
Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to eac...
Thursday, February 26, 2026
Zeyu Xie, Chenxing Li, Qiao Jin ... · arXiv
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discrimi...
Trung Dang, Sharath Rao, Ananya Gupta ... · arXiv
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are sig...
Sanjid Hasan, Risalat Labib, A H M Fuad ... · arXiv
Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, w...
Wednesday, February 25, 2026
Songjun Cao, Yuqi Li, Yunpeng Luo ... · arXiv
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains...
Yuzhu Wang, Archontis Politis, Konstantinos Drossos ... · IEEE Transactions on Audio, Speech and Language Processing
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both featu...
Yuxuan Chen, Peize He, Haoyuan Xu ... · arXiv
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task tr...
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen ... · LREC 2026
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien...
Tuesday, February 24, 2026
Townim Faisal Chowdhury, Ta Duc Huy, Siqi Pan ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated conce...
Monday, February 23, 2026
Yisi Liu, Nicholas Lee, Gopala Anumanchipalli · arXiv
Voice style conversion aims to transform an input utterance to match a target speaker's timbre, accent, and emotion, with a central challenge being the disentanglement of linguistic content from style. While prior work has explored this problem, conversion quality remains limited...
Karan Thakkar, Mounya Elhilali · ICASSP 2026
Reconstructing the speech audio envelope from scalp neural recordings (EEG) is a central task for decoding a listener's attentional focus in applications like neuro-steered hearing aids. Current methods for this reconstruction, however, face challenges with fidelity and noise. Pr...
Hanwen Liu, Saierdaer Yusuyin, Hao Huang ... · INTERSPEECH 2026
Large-language-model (LLM)-based text-to-speech (TTS) systems can generate natural speech, but most are not designed for low-latency dual-streaming synthesis. High-quality dual-streaming TTS depends on accurate text--speech alignment and well-designed training sequences that bala...
Sifei Li, Yang Li, Zizhou Wang ... · ICLR 2026
Cover songs constitute a vital aspect of musical culture, preserving the core melody of an original composition while reinterpreting it to infuse novel emotional depth and thematic emphasis. Although prior research has explored the reinterpretation of instrumental music through m...
Yue Pan, Xingyao Wang, Hanyue Zhang ... · arXiv
Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional c...
Yue Pan, Xingyao Wang, Hanyue Zhang ... · arXiv
Remote monitoring of heart failure (HF) via speech signals provides a non-invasive and cost-effective solution for long-term patient management. However, substantial inter-individual heterogeneity in vocal characteristics often limits the accuracy of traditional cross-sectional c...
Yungang Yi · arXiv
Long-context modeling is essential for symbolic music generation, since motif repetition and developmental variation can span thousands of musical events. However, practical composition and performance workflows frequently rely on resource-limited devices (e.g., electronic instru...
Nghia Phan, Rong Jin, Gang Liu ... · arXiv
Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we prese...
Sunday, February 22, 2026
Qibing Bai, Shuhao Shi, Shuai Wang ... · ICASSP 2026
Accent normalization (AN) systems often struggle with unnatural outputs and undesired content distortion, stemming from both suboptimal training data and rigid duration modeling. In this paper, we propose a "source-synthesis" methodology for training data construction. By generat...