Audio ML Papers

Last 7 Days (April 04 - April 11, 2026)

Subcategories: All (22) | Speech Synthesis (7) | Music Synthesis (4) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (1) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (7)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Yassine El Kheir, Arnab Das, Yixuan Xiao ... · arXiv
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, w...
#2 TOP PAPER (Score: 85)
Xudong Lu, Yang Bo, Jinpeng Chen ... · arXiv
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs ...
#3 TOP PAPER (Score: 83)
Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang ... · arXiv
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Co...
Thursday, April 09, 2026
Yassine El Kheir, Arnab Das, Yixuan Xiao ... · arXiv
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, w...
Yuankun Xie, Haonan Cheng, Jiayi Zhou ... · ACM Multimedia 2026 Grand Challenge
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content produ...
Xiaosu Su, Zihan Sun, Peilei Jia ... · arXiv
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance gene...
Gabriel Dubus, Théau d'Audiffret, Claire Auger ... · arXiv
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce Deep...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and over...
Jing Peng, Chenghao Wang, Yi Yang ... · arXiv
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcript...
Linge Wang, Yingying Chen, Bingke Zhu ... · arXiv
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches desi...
Wednesday, April 08, 2026
Yuxuan Wang, Peize He, Xiyan Gui ... · arXiv
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in th...
Tornike Karchkhadze, Shlomo Dubnov · arXiv
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playb...
Nursadul Mamun, John H. L. Hansen · 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) · 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recen...
Tuesday, April 07, 2026
Jia-Hong Huang, Seulgi Kim, Yi Chieh Liu ... · IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of syntheti...
Zhetao Hu, Yiquan Zhou, Wenyu Wang ... · arXiv
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leaka...
Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao ... · arXiv
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks hav...
Boyu Cao, Lekai Qian, Dehan Li ... · ACL 2026 Findings
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe er...
Monday, April 06, 2026
Xuanjun Chen, Chia-Yu Hu, Sung-Feng Huang ... · arXiv
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Co...
Jia Li, Yinfeng Yu · International Joint Conference on Neural Networks (IJCNN 2026)
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training ...
Guan-Ting Lin, Chen Chen, Zhehuai Chen ... · arXiv
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with s...
Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen ... · arXiv
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-cond...
Sunday, April 05, 2026
Xudong Lu, Yang Bo, Jinpeng Chen ... · arXiv
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs ...
Tianhua Qi, Wenming Zheng, Björn W. Schuller ... · arXiv
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have s...
Donghuo Zeng, Hao Niu, Masato Taya · IEEE ICME 2026
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoen...
Saturday, April 04, 2026
Lo-Ya Li, Tien-Hong Lo, Jeih-Weih Hung ... · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026 · Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computatio...