Audio ML Papers

Last 7 Days (February 05 - February 12, 2026)

Subcategories: All (22) | Speech Synthesis (6) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (1) | Asr (1) | Other (10)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 84)
Georg Heigold, Ehsan Variani, Tom Bagby ... · arXiv
Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction...
#2 TOP PAPER (Score: 84)
Videet Mehta, Liming Wang, Hilde Kuehne ... · arXiv
Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classific...
#3 TOP PAPER (Score: 83)
Chunyat Wu, Jiajun Deng, Zhengxi Liu ... · ICASSP 2026
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of...
Tuesday, February 10, 2026
Heitor R. Guimarães, Abhishek Tiwari, Mahsa Abdollahi ... · arXiv
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio enco...
Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres ... · International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2026
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, h...
Wenfu Wang, Chenxing Li, Liqiang Zhang ... · arXiv
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-t...
Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah ... · arXiv
Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global emb...
Monday, February 09, 2026
Jackie Lin, Jiaqi Su, Nishit Anand ... · IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2026
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications ca...
Chengzhong Wang, Andong Li, Dingding Yao ... · arXiv
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we ...
Kohei Saijo, Yoshiaki Bando · IEEE Transactions on Audio, Speech, and Language Processing (TASLP)
Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as mu...
Yufan Wen, Zhaocheng Liu, YeGuo Hua ... · arXiv
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these...
Haoshen Wang, Xueli Zhong, Bingbing Lin ... · arXiv
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speake...
Jiatao Chen, Xing Tang, Xiaoyue Duan ... · arXiv
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framewor...
Yi Liu, Chuan-Che Jeff Huang, Xiao Quan · ICASSP 2026
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterance...
Sunday, February 08, 2026
Shaad Sufi · arXiv
Current audio formats present a fundamental trade-off between file size and functionality: lossless formats like FLAC preserve quality but lack adaptability, while lossy formats reduce size at the cost of fidelity and offer no stem-level access.We introduce the Stem-Native Codec ...
Jiale Qian, Hao Meng, Tian Zheng ... · arXiv
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-S...
Friday, February 06, 2026
Videet Mehta, Liming Wang, Hilde Kuehne ... · arXiv
Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classific...
Georg Heigold, Ehsan Variani, Tom Bagby ... · arXiv
Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction...
Yuancheng Wang, Zhenyu Tang, Yun Wang ... · arXiv
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose...
Ziyu Luo, Lin Chen, Qiang Qu ... · arXiv
Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the d...
Hugo Seuté, Pranai Vasudev, Etienne Richan ... · Temporary pre-print, will be updated. In review at a conference
Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses o...
Thursday, February 05, 2026
Chunyat Wu, Jiajun Deng, Zhengxi Liu ... · ICASSP 2026
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of...
Kaiyuan Zhang, Mohan Shi, Eray Eren ... · arXiv
Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation...
Qing Wen, Haohao Li, Zhongjie Ba ... · arXiv
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise rel...
Haoqin Sun, Chenyang Lyu, Shiwan Zhao ... · arXiv
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints req...