Audio ML Papers

Last 7 Days (April 03 - April 10, 2026)

Subcategories: All (19) | Speech Synthesis (4) | Music Synthesis (4) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (0) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (9)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 85)
Xudong Lu, Yang Bo, Jinpeng Chen ... ยท arXiv
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs ...
#2 TOP PAPER (Score: 83)
Zhennan Lin, Shuai Wang, Zhaokai Sun ... ยท arXiv
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn...
#3 TOP PAPER (Score: 83)
Inbal Rimon, Oren Gal, Haim Permuter ยท arXiv
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer fram...
Wednesday, April 08, 2026
Yuxuan Wang, Peize He, Xiyan Gui ... ยท arXiv
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in th...
Nursadul Mamun, John H. L. Hansen ยท 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) ยท 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recen...
Tuesday, April 07, 2026
Jia-Hong Huang, Seulgi Kim, Yi Chieh Liu ... ยท IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of syntheti...
Zhetao Hu, Yiquan Zhou, Wenyu Wang ... ยท arXiv
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leaka...
Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao ... ยท arXiv
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks hav...
Boyu Cao, Lekai Qian, Dehan Li ... ยท ACL 2026 Findings
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe er...
Monday, April 06, 2026
Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang ... ยท arXiv
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Co...
Jia Li, Yinfeng Yu ยท International Joint Conference on Neural Networks (IJCNN 2026)
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training ...
Guan-Ting Lin, Chen Chen, Zhehuai Chen ... ยท arXiv
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with s...
Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen ... ยท arXiv
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-cond...
Sunday, April 05, 2026
Xudong Lu, Yang Bo, Jinpeng Chen ... ยท arXiv
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs ...
Tianhua Qi, Wenming Zheng, Bjรถrn W. Schuller ... ยท arXiv
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have s...
Donghuo Zeng, Hao Niu, Masato Taya ยท IEEE ICME 2026
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoen...
Saturday, April 04, 2026
Lo-Ya Li, Tien-Hong Lo, Jeih-Weih Hung ... ยท Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026 ยท Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computatio...
Friday, April 03, 2026
Zhennan Lin, Shuai Wang, Zhaokai Sun ... ยท arXiv
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn...
Inbal Rimon, Oren Gal, Haim Permuter ยท arXiv
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer fram...
Ziyu Luo, Lin Chen, Qiang Qu ... ยท arXiv
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an i...
FNU Sidharth, Meysam Asgari, Hao-Wen Dong ... ยท arXiv
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the cont...
Xunyi Jiang, Mingyang Yao, Jingyue Huang ... ยท arXiv
Symbolic music generation has made significant progress, yet achieving fine-grained and flexible control over composer style remains challenging. Existing training-based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically sup...