Audio ML Papers

Last 7 Days (April 07 - April 14, 2026)

Subcategories: All (35) | Speech Synthesis (6) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (3) | Asr (6) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Yassine El Kheir, Arnab Das, Yixuan Xiao ... · arXiv
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, w...
#2 TOP PAPER (Score: 84)
Kuang Yuan, Freddy Yifei Liu, Tong Xiao ... · arXiv
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the ...
#3 TOP PAPER (Score: 84)
Yingjie Yu, Mingyuan Wu, Ahmadreza Eslaminia ... · arXiv
QoS-QoE translation is a fundamental problem in multimedia systems because it characterizes how measurable system and network conditions affect user-perceived experience. Although many prior studies have examined this relationship, their findings are often developed for specific ...
Friday, April 10, 2026
Chunhao Bi, Houqiang Zhong, Zhixin Xu ... · arXiv
Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often stru...
Mintong Kang, Chen Fang, Bo Li · arXiv
Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just "unsafe text spoken aloud": real-world risks can hinge on audio-native harmful sound events, speaker attr...
Wataru Nakata, Yuki Saito, Kazuki Yamauchi ... · arXiv
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuita...
Pengbo Lyu, Xiangyu Zhao, Chengwei Liu ... · arXiv
We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Con...
Yunqiang Wang, Hengyuan Na, Di Wu ... · arXiv
Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcripti...
Fei Liu, Yang Ai, Hui-Peng Du ... · arXiv
Audio super-resolution aims to recover missing high-frequency details from bandwidth-limited low-resolution audio, thereby improving the naturalness and perceptual quality of the reconstructed signal. However, most existing methods directly operate in the waveform or time-frequen...
Qixuan Huang, Khalid Zaman, Masashi Unoki · arXiv
Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classificati...
Ziwei Li, Lukuang Dong, Saierdaer Yusuyin ... · arXiv
Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an a...
Jian Zhu, Jianwei Cui, Shihao Chen ... · arXiv
We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innova...
Aditya Narayan Sankaran, Reza Farahbakhsh, Noel Crespi · arXiv
Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classific...
Zihe Wei, Yuezun Li · arXiv
Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. W...
Thursday, April 09, 2026
Yuankun Xie, Haonan Cheng, Jiayi Zhou ... · ACM Multimedia 2026
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content produ...
Xiaosu Su, Zihan Sun, Peilei Jia ... ·
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance gene...
Chanhyuk Choi, Taesoo Kim, Donggyu Lee ... · arXiv
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive fl...
Gabriel Dubus, Théau d'Audiffret, Claire Auger ... ·
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce Deep...
Zhicheng Ouyang, Seong-Gyun Leem, Bach Viet Do ... · arXiv
Conversational AI has made significant progress, yet generating expressive and controllable text-to-speech (TTS) remains challenging. Specifically, controlling fine-grained voice styles and emotions is notoriously difficult and typically requires massive amounts of heavily annota...
Matthew Maciejewski, Samuele Cornell ·
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is...
Jing Peng, Chenghao Wang, Yi Yang ... ·
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcript...
Yuan Xie, Jiaqi Song, Guang Qiu ... · arXiv
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and over...
Linge Wang, Yingying Chen, Bingke Zhu ... · arXiv
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches desi...
Hanif Rahman · arXiv
Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target ...
Wednesday, April 08, 2026
Yuxuan Wang, Peize He, Xiyan Gui ... ·
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in th...
Zikai Liu, Ziqian Wang, Xingchen Li ... ·
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extr...
Ya Zhao, Yinfeng Yu, Liejun Wang ·
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource p...
Tornike Karchkhadze, Shlomo Dubnov ·
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playb...
Ameenudeen P E, Charumathi Narayanan, Sriram Ganapathy ·
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the o...
Nursadul Mamun, John H. L. Hansen · 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) · 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recen...
Tuesday, April 07, 2026
Jia-Hong Huang, Seulgi Kim, Yi Chieh Liu ... ·
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of syntheti...
Zhetao Hu, Yiquan Zhou, Wenyu Wang ... ·
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leaka...
Chen Su, Yuanhe Tian, Yan Song ·
Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse...
Aravinda Reddy PN, Raghavendra Ramachandra, K. Sreenivasa Rao ... ·
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks hav...
Boyu Cao, Lekai Qian, Dehan Li ... · ACL 2026 Findings
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe er...