Audio ML Papers

Last 7 Days (April 02 - April 09, 2026)

Subcategories: All (23) | Speech Synthesis (4) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (1) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 85)
Xudong Lu, Yang Bo, Jinpeng Chen ... · arXiv
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs ...
#2 TOP PAPER (Score: 83)
Chengyou Wang, Hongfei Xue, Chunjiang He ... · arXiv
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches eith...
#3 TOP PAPER (Score: 83)
Xiaobin Rong, Yushi Wang, Zheng Wang ... · ICASSP 2026
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the wa...
Tuesday, April 07, 2026
Zhetao Hu, Yiquan Zhou, Wenyu Wang ... · arXiv
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leaka...
Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao ... · arXiv
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks hav...
Boyu Cao, Lekai Qian, Dehan Li ... · ACL 2026 Findings
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe er...
Monday, April 06, 2026
Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang ... · arXiv
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Co...
Jia Li, Yinfeng Yu · International Joint Conference on Neural Networks (IJCNN 2026)
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training ...
Guan-Ting Lin, Chen Chen, Zhehuai Chen ... · arXiv
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with s...
Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen ... · arXiv
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-cond...
Sunday, April 05, 2026
Xudong Lu, Yang Bo, Jinpeng Chen ... · arXiv
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs ...
Tianhua Qi, Wenming Zheng, Björn W. Schuller ... · arXiv
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have s...
Donghuo Zeng, Hao Niu, Masato Taya · IEEE ICME 2026
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoen...
Saturday, April 04, 2026
Lo-Ya Li, Tien-Hong Lo, Jeih-Weih Hung ... · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026 · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computatio...
Friday, April 03, 2026
Zhennan Lin, Shuai Wang, Zhaokai Sun ... · arXiv
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn...
Inbal Rimon, Oren Gal, Haim Permuter · arXiv
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer fram...
Ziyu Luo, Lin Chen, Qiang Qu ... · arXiv
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an i...
FNU Sidharth, Meysam Asgari, Hao-Wen Dong ... · arXiv
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the cont...
Xunyi Jiang, Mingyang Yao, Jingyue Huang ... · arXiv
Symbolic music generation has made significant progress, yet achieving fine-grained and flexible control over composer style remains challenging. Existing training-based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically sup...
Thursday, April 02, 2026
Chengyou Wang, Hongfei Xue, Chunjiang He ... · arXiv
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches eith...
Xiaobin Rong, Yushi Wang, Zheng Wang ... · ICASSP 2026
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the wa...
Yi Ma, Shuai Wang, Tianchi Liu ... · IEEE Transactions on Audio, Speech and Language Processing
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonet...
Hongjun Liu, Rujun Han, Leyu Zhou ... · arXiv
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical...
Chihiro Arata, Kiyoshi Kurihara · arXiv
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We p...
Fuxiang Tao, Dongwei Li, Shuning Tang ... · arXiv
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initiall...
Teng Liu, Yinfeng Yu · International Joint Conference on Neural Networks (IJCNN 2026)
Audio-Visual Navigation (AVN) requires an embodied agent to navigate toward a sound source by utilizing both vision and binaural audio. A core challenge arises in complex acoustic environments, where binaural cues become intermittently unreliable, particularly when generalizing t...