Audio ML Papers

Last 7 Days (February 07 - February 14, 2026)

Subcategories: All (24) | Speech Synthesis (8) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Assessment (0) | Enhancement (1) | Asr (0) | Other (11)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
#2 TOP PAPER (Score: 84)
Jingru Lin, Chen Zhang, Tianrui Wang ... · Audio-AAAI
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus onl...
#3 TOP PAPER (Score: 83)
Jiale Qian, Hao Meng, Tian Zheng ... · arXiv
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-S...
Thursday, February 12, 2026
Daiqing Wu, Xuan Zhang, Dongbao Yang ... · ICLR 2026
The maturation of Large Audio Language Models (LALMs) has raised growing expectations for them to comprehend complex audio much like humans. Current efforts primarily replicate text-based reasoning by contextualizing audio content through a one-time encoding, which introduces a c...
Xingyu Chen, Hanwen Bi, Fei Ma ... · arXiv
Accurate upsampling of Head-Related Transfer Functions (HRTFs) from sparse measurements is crucial for personalized spatial audio rendering. Traditional interpolation methods, such as kernel-based weighting or basis function expansions, rely on measurements from a single subject ...
Yifan Liang, Andong Li, Kang Yang ... · arXiv
Although lip-to-speech synthesis (L2S) has achieved significant progress in recent years, current state-of-the-art methods typically rely on intermediate representations such as mel-spectrograms or discrete self-supervised learning (SSL) tokens. The potential of latent diffusion ...
Wednesday, February 11, 2026
Jingru Lin, Chen Zhang, Tianrui Wang ... · Audio-AAAI
Due to recent advancements in Large Audio-Language Models (LALMs) that demonstrate remarkable performance across a range of sound-, speech- and music-related tasks, there is a growing interest in proposing benchmarks to assess these models. Existing benchmarks generally focus onl...
Liyang Chen, Hongkai Chen, Yujun Cai ... · arXiv
Large Audio Language Models (LALMs) have demonstrated strong capabilities in audio understanding and reasoning. However, their performance on fine grained auditory perception remains unreliable, and existing approaches largely rely on data intensive training to internalize percep...
Yitian Gong, Kuangwei Chen, Zhaoye Fei ... · arXiv
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures....
Yitian Gong, Kuangwei Chen, Zhaoye Fei ... · arXiv
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures....
Yuanxin Xia, Xinyan Li, Matteo Calafà ... · arXiv
Standardized laboratory characterizations for absorbing materials rely on idealized sound field assumptions, which deviate largely from real-life conditions. Consequently, \emph{in-situ} acoustic characterization has become essential for accurate diagnosis and virtual prototyping...
Tuesday, February 10, 2026
Heitor R. Guimarães, Abhishek Tiwari, Mahsa Abdollahi ... · arXiv
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio enco...
Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, h...
Wenfu Wang, Chenxing Li, Liqiang Zhang ... · arXiv
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-t...
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah ... · arXiv
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global emb...
Monday, February 09, 2026
Jackie Lin, Jiaqi Su, Nishit Anand ... · IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications ca...
Chengzhong Wang, Andong Li, Dingding Yao ... · arXiv
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we ...
Kohei Saijo, Yoshiaki Bando · IEEE Transactions on Audio, Speech, and Language Processing (TASLP)
Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as mu...
Yufan Wen, Zhaocheng Liu, YeGuo Hua ... · arXiv
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these...
Yufan Wen, Zhaocheng Liu, YeGuo Hua ... · arXiv
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these...
Yi Liu, Chuan-Che Huang, Xiao Quan · ICASSP 2026
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterance...
Luan Vinícius Fiorio, Ivana Nikoloska, Bruno Defraene ... · arXiv
Sound source tracking is commonly performed using classical array-processing algorithms, while machine-learning approaches typically rely on precise source position labels that are expensive or impractical to obtain. This paper introduces a physics-guided variational model capabl...
Haoshen Wang, Xueli Zhong, Bingbing Lin ... · arXiv
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speake...
Jiatao Chen, Xing Tang, Xiaoyue Duan ... · arXiv
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framewor...
Yi Liu, Chuan-Che Jeff Huang, Xiao Quan · ICASSP 2026
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterance...
Sunday, February 08, 2026
Shaad Sufi · arXiv
Current audio formats present a fundamental trade-off between file size and functionality: lossless formats like FLAC preserve quality but lack adaptability, while lossy formats reduce size at the cost of fidelity and offer no stem-level access.We introduce the Stem-Native Codec ...
Jiale Qian, Hao Meng, Tian Zheng ... · arXiv
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-S...