Audio ML Papers

Week of November 09 - November 16, 2025

Subcategories: All (33) | Speech Synthesis (10) | Music Synthesis (4) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (1) | Asr (5) | Other (13)
← Previous Week | Next Week → | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Umberto Cappellazzo, Xubo Liu, Pingchuan Ma ... · arXiv
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-ba...
#2 TOP PAPER (Score: 83)
Zhisheng Zhang, Derui Wang, Yifan Mi ... · NeurIPS 2025
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense technique...
#3 TOP PAPER (Score: 83)
Andong Li, Tong Lei, Rilin Chen ... · arXiv
This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast t...
Saturday, November 15, 2025
Zhisheng Zheng, Puyuan Peng, Anuj Diwan ... · arXiv
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polis...
Xinyue Yu, Youqing Fang, Pingyu Wu ... · arXiv
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mec...
Xinyue Yu, Youqing Fang, Pingyu Wu ... · AAAI 2026
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mec...
Friday, November 14, 2025
Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao ... · arXiv
Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-speci...
HongYu Liu, Junxin Li, Changxi Guo ... · Proceedings of the 28th European Conference on Artificial Intelligence (ECAI 2025)
Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogG...
Guangke Chen, Yuhui Wang, Shouling Ji ... · arXiv
Modern text-to-speech (TTS) systems, particularly those built on Large Audio-Language Models (LALMs), generate high-fidelity speech that faithfully reproduces input text and mimics specified speaker identities. While prior misuse studies have focused on speaker impersonation, thi...
Hualei Wang, Yiming Li, Shuo Ma ... · The Fortieth AAAI Conference on Artificial Intelligence (AAAI 2026)
Recent Large Audio-Language Models (LALMs) exhibit impressive capabilities in understanding audio content for conversational QA tasks. However, these models struggle to accurately understand timestamps for temporal localization (e.g., Temporal Audio Grounding) and are restricted ...
Yifan Zhuang, Calvin Huang, Zepeng Yu ... · arXiv
Brain-computer interface (BCI) speech decoding has emerged as a promising tool for assisting individuals with speech impairments. In this context, the integration of electroencephalography (EEG) and electromyography (EMG) signals offers strong potential for enhancing decoding per...
Thursday, November 13, 2025
Sreyan Ghosh, Arushi Goel, Lasha Koroshinadze ... · arXiv
We introduce Music Flamingo, a novel large audio-language model designed to advance music (including song) understanding in foundational audio models. While audio-language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dens...
Yudong Yang, Xuezhen Zhang, Zhifeng Han ... · arXiv
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition...
Yudong Yang, Xuezhen Zhang, Zhifeng Han ... · arXiv
Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition...
Farhan Sheth, Girish, Mohd Mujtaba Akhtar ... · IJCNLP-AACL 2025
In this work, we address the challenge of generalizable audio deepfake detection (ADD) across diverse speech synthesis paradigms-including conventional text-to-speech (TTS) systems and modern diffusion or flow-matching (FM) based generators. Prior work has mostly targeted individ...
Wednesday, November 12, 2025
Tianzi Wang, Xurong Xie, Zengrui Jin ... · IEEE Transactions on Audio, Speech and Language Processing
Automatic speech recognition (ASR) systems often rely on autoregressive (AR) Transformer decoder architectures, which limit efficient inference parallelization due to their sequential nature. To this end, non-autoregressive (NAR) approaches aim primarily to achieve significant de...
Xinyi Tong, Yiran Zhu, Jishang Chen ... · arXiv
Video-to-Music generation seeks to generate musically appropriate background music that enhances audiovisual immersion for videos. However, current approaches suffer from two critical limitations: 1) incomplete representation of video details, leading to weak alignment, and 2) in...
Shulei Ji, Zihao Wang, Jiaxing Yu ... · arXiv
Video-to-music (V2M) generation aims to create music that aligns with visual content. However, two main challenges persist in existing methods: (1) the lack of explicit rhythm modeling hinders audiovisual temporal alignments; (2) effectively integrating various visual features to...
Jiliang Hu, Zuchao Li, Baoyuan Qi ... · AAAI 2026
Significant progress has been made in spoken question answering (SQA) in recent years. However, many existing methods, including large audio language models, struggle with processing long audio. Follow the success of retrieval augmented generation, a speech-related retriever show...
Hongyi Li, Chengxuan Zhou, Chu Wang ... · AAAI 2026
Large Audio-language Models (LAMs) have recently enabled powerful speech-based interactions by coupling audio encoders with Large Language Models (LLMs). However, the security of LAMs under adversarial attacks remains underexplored, especially through audio jailbreaks that craft ...
Tuesday, November 11, 2025
Eloi Moliner, Marco A. Martínez-Ramírez, Junghyun Koo ... · arXiv
Music mixing involves combining individual tracks into a cohesive mixture, a task characterized by subjectivity where multiple valid solutions exist for the same input. Existing automatic mixing systems treat this task as a deterministic regression problem, thus ignoring this mul...
Bingsong Bai, Yizhong Geng, Fengping Wang ... · AAAI 2026 main technical track
Zero-shot singing voice conversion (SVC) transforms a source singer's timbre to an unseen target speaker's voice while preserving melodic content without fine-tuning. Existing methods model speaker timbre and vocal content separately, losing essential acoustic information that de...
Julian Irigoyen, Arthur Söhler, Andreas Søeborg Kirkedal · arXiv
We challenge the conventional view of neural network pruning as solely a compression technique, demonstrating that one-shot magnitude pruning serves as a powerful implicit regularizer for ASR. Using Whisper-small, we combine gradient- and Fisher-based sensitivity diagnostics with...
Xueyao Zhang, Chaoren Wang, Huan Liao ... · arXiv
Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address...
Jiaqi Wang, Liutao Yu, Xiongri Shen ... · The Fortieth AAAI Conference on Artificial Intelligence (AAAI 2026)
Spiking neural networks (SNNs) offer a promising path toward energy-efficient speech command recognition (SCR) by leveraging their event-driven processing paradigm. However, existing SNN-based SCR methods often struggle to capture rich temporal dependencies and contextual informa...
Lu Gan, Xi Li · arXiv
The development of high-performance, on-device keyword spotting (KWS) systems for ultra-low-power hardware is critically constrained by the scarcity of specialized, multi-command training datasets. Traditional data collection through human recording is costly, slow, and lacks sca...
Shu-wen Yang, Ming Tu, Andy T. Liu ... · arXiv
Speech-to-Speech (S2S) models have shown promising dialogue capabilities, but their ability to handle paralinguistic cues--such as emotion, tone, and speaker attributes--and to respond appropriately in both content and style remains underexplored. Progress is further hindered by ...
Monday, November 10, 2025
Andong Li, Tong Lei, Rilin Chen ... · arXiv
This paper revisits the neural vocoder task through the lens of audio restoration and propose a novel diffusion vocoder called BridgeVoC. Specifically, by rank analysis, we compare the rank characteristics of Mel-spectrum with other common acoustic degradation factors, and cast t...
Zhisheng Zhang, Derui Wang, Yifan Mi ... · NeurIPS 2025
Recent advancements in speech synthesis technology have enriched our daily lives, with high-quality and human-like audio widely adopted across real-world applications. However, malicious exploitation like voice-cloning fraud poses severe security risks. Existing defense technique...
Umberto Cappellazzo, Xubo Liu, Pingchuan Ma ... · arXiv
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities, including Auditory Speech Recognition (ASR), Visual Speech Recognition (VSR), and Audio-Visual Speech Recognition (AVSR). Despite this progress, current LLM-ba...
Feyisayo Olalere, Kiki van der Heijden, H. Christiaan Stronks ... · arXiv
Classroom environments are particularly challenging for children with hearing impairments, where background noise, multiple talkers, and reverberation degrade speech perception. These difficulties are greater for children than adults, yet most deep learning speech separation mode...
S Sakshi, Vaibhavi Lokegaonkar, Neil Zhang ... · arXiv
Spatial perception is central to auditory intelligence, enabling accurate understanding of real-world acoustic scenes and advancing human-level perception of the world around us. While recent large audio-language models (LALMs) show strong reasoning over complex audios, most oper...
Weilong Huang, Srikanth Raj Chetupalli, Mhd Modar Halimeh ... · arXiv
Beamforming with desired directivity patterns using compact microphone arrays is essential in many audio applications. Directivity patterns achievable using traditional beamformers depend on the number of microphones and the array aperture. Generally, their effectiveness degrades...
Weilong Huang, Srikanth Raj Chetupalli, Mhd Modar Halimeh ... · arXiv
Beamforming with desired directivity patterns using compact microphone arrays is essential in many audio applications. Directivity patterns achievable using traditional beamformers depend on the number of microphones and the array aperture. Generally, their effectiveness degrades...
Euihyeok Lee, Seonghyeon Kim, SangHun Im ... · arXiv
Self-talk-an internal dialogue that can occur silently or be spoken aloud-plays a crucial role in emotional regulation, cognitive processing, and motivation, yet has remained largely invisible and unmeasurable in everyday life. In this paper, we present MutterMeter, a mobile syst...
Sunday, November 09, 2025
Dachao Han, Teng Huang, Han Ding ... · arXiv
With the rise of voice-enabled technologies, loudspeaker playback has become widespread, posing increasing risks to speech privacy. Traditional eavesdropping methods often require invasive access or line-of-sight, limiting their practicality. In this paper, we present mmSpeech, a...