Audio ML Papers

Last 7 Days (April 01 - April 08, 2026)

Subcategories: All (15) | Speech Synthesis (1) | Music Synthesis (0) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (10)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Jeremy Zhengqi Huang, Emani Hicks, Sidharth ... · arXiv
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an int...
#2 TOP PAPER (Score: 83)
Chengyou Wang, Hongfei Xue, Chunjiang He ... · arXiv
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches eith...
#3 TOP PAPER (Score: 83)
Xiaobin Rong, Yushi Wang, Zheng Wang ... · ICASSP 2026
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the wa...
Friday, April 03, 2026
Zhennan Lin, Shuai Wang, Zhaokai Sun ... · arXiv
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn...
Inbal Rimon, Oren Gal, Haim Permuter · arXiv
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer fram...
Ziyu Luo, Lin Chen, Qiang Qu ... · arXiv
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an i...
FNU Sidharth, Meysam Asgari, Hao-Wen Dong ... · arXiv
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the cont...
Thursday, April 02, 2026
Chengyou Wang, Hongfei Xue, Chunjiang He ... · arXiv
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches eith...
Xiaobin Rong, Yushi Wang, Zheng Wang ... · ICASSP 2026
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the wa...
Yi Ma, Shuai Wang, Tianchi Liu ... · IEEE Transactions on Audio, Speech and Language Processing
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonet...
Hongjun Liu, Rujun Han, Leyu Zhou ... · arXiv
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical...
Chihiro Arata, Kiyoshi Kurihara · arXiv
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We p...
Fuxiang Tao, Dongwei Li, Shuning Tang ... · arXiv
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initiall...
Teng Liu, Yinfeng Yu · International Joint Conference on Neural Networks (IJCNN 2026)
Audio-Visual Navigation (AVN) requires an embodied agent to navigate toward a sound source by utilizing both vision and binaural audio. A core challenge arises in complex acoustic environments, where binaural cues become intermittently unreliable, particularly when generalizing t...
Wednesday, April 01, 2026
Vojtěch Staněk, Martin Perešíni, Lukáš Sekanina ... · WCCI CEC 2026
While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary mu...
Jeremy Zhengqi Huang, Emani Hicks, Sidharth ... · arXiv
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an int...
Xiquan Li, Xuenan Xu, Ziyang Ma ... · arXiv
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with...
Awais Khan, Muhammad Umar Farooq, Kutub Uddin ... · arXiv
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and m...