Audio ML Papers

Week of December 28 - January 04, 2026

Subcategories: All (9) | Speech Synthesis (0) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (0) | Asr (3) | Other (4)
← Previous Week | Next Week → | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 83)
Yuan-Kuei Wu, Yang Liu, Yiteng Huang ... ยท arXiv
Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-inten...
#2 TOP PAPER (Score: 83)
Yanxi Chen, Wenhui Zhu, Xiwen Chen ... ยท arXiv
Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False ...
#3 TOP PAPER (Score: 83)
Tianxin Xie, Wentao Lei, Guanjie Huang ... ยท arXiv
Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, prim...
Saturday, January 03, 2026
Jiajie Zhu, Xia Du, Xiaoyuan Liu ... ยท arXiv
The rapid advancements in artificial intelligence have significantly accelerated the adoption of speech recognition technology, leading to its widespread integration across various applications. However, this surge in usage also highlights a critical issue: audio data is highly v...
Thursday, January 01, 2026
Zhuoran Zhuang, Ye Chen, Chao Luo ... ยท arXiv
End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language models through static graph composition, pr...
Wednesday, December 31, 2025
Yuan-Kuei Wu, Yang Liu, Yiteng Huang ... ยท arXiv
Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-inten...
Tuesday, December 30, 2025
Yanxi Chen, Wenhui Zhu, Xiwen Chen ... ยท arXiv
Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False ...
Tianxin Xie, Wentao Lei, Guanjie Huang ... ยท arXiv
Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, prim...
Monday, December 29, 2025
Saifelden M. Ismail ยท 5 pages, 2 tables, 1 figure. Not yet submitted to a conference
Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled ...
Saifelden M. Ismail ยท 5 pages, 2 tables, 1 figure. Submitted to IEEE conference
Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled ...
Roee Ziv, Raz Lapid, Moshe Sipper ยท arXiv
Audio-language models combine audio encoders with large language models to enable multimodal reasoning, but they also introduce new security vulnerabilities. We propose a universal targeted latent space attack, an encoder-level adversarial attack that manipulates audio latent rep...
Zengwei Yao, Wei Kang, Han Zhu ... ยท arXiv
Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence and potential mode collapse during training, while diffusion methods require multi-step inference that i...