Audio ML Papers

Last 7 Days (April 06 - April 13, 2026)

Subcategories: All (23) | Speech Synthesis (5) | Music Synthesis (4) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (8)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Yassine El Kheir, Arnab Das, Yixuan Xiao ... · arXiv
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, w...
#2 TOP PAPER (Score: 84)
Kuang Yuan, Freddy Yifei Liu, Tong Xiao ... · arXiv
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the ...
#3 TOP PAPER (Score: 83)
Chen Su, Yuanhe Tian, Yan Song ·
Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse...
Thursday, April 09, 2026
Yuankun Xie, Haonan Cheng, Jiayi Zhou ... · ACM Multimedia 2026
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content produ...
Xiaosu Su, Zihan Sun, Peilei Jia ... ·
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance gene...
Gabriel Dubus, Théau d'Audiffret, Claire Auger ... ·
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce Deep...
Matthew Maciejewski, Samuele Cornell ·
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is...
Jing Peng, Chenghao Wang, Yi Yang ... ·
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcript...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and over...
Linge Wang, Yingying Chen, Bingke Zhu ... · arXiv
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches desi...
Wednesday, April 08, 2026
Yuxuan Wang, Peize He, Xiyan Gui ... ·
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in th...
Zikai Liu, Ziqian Wang, Xingchen Li ... ·
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extr...
Ya Zhao, Yinfeng Yu, Liejun Wang ·
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource p...
Tornike Karchkhadze, Shlomo Dubnov ·
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playb...
Ameenudeen P E, Charumathi Narayanan, Sriram Ganapathy ·
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the o...
Nursadul Mamun, John H. L. Hansen · 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) · 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recen...
Tuesday, April 07, 2026
Jia-Hong Huang, Seulgi Kim, Yi Chieh Liu ... ·
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of syntheti...
Zhetao Hu, Yiquan Zhou, Wenyu Wang ... ·
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leaka...
Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao ... ·
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks hav...
Boyu Cao, Lekai Qian, Dehan Li ... · ACL 2026 Findings
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe er...
Monday, April 06, 2026
Jia Li, Yinfeng Yu · International Joint Conference on Neural Networks (IJCNN 2026)
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training ...
Guan-Ting Lin, Chen Chen, Zhehuai Chen ... · arXiv
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with s...
Weiguo Pian, Saksham Singh Kushwaha, Zhimin Chen ... · arXiv
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-cond...