Audio ML Papers

Week of January 11 - January 18, 2026

Subcategories: All (21) | Speech Synthesis (8) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (1) | Asr (4) | Other (4)
← Previous Week | Next Week → | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 91)
Chengyou Wang, Mingchen Shao, Jingbin Hu ... ยท arXiv
Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech...
#2 TOP PAPER (Score: 83)
Dongchao Yang, Yuxin Xie, Yuguo Yin ... ยท arXiv
We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor,...
#3 TOP PAPER (Score: 83)
Jingbin Hu, Huakang Chen, Linhan Ma ... ยท arXiv
Despite rapid progress in text-to-speech (TTS), open-source systems still lack truly instruction-following, fine-grained control over core speech attributes (e.g., pitch, speaking rate, age, emotion, and style). We present VoiceSculptor, an open-source unified system that bridges...
Saturday, January 17, 2026
Ziang Guo, Feng Yang, Xuefeng Zhang ... ยท IV
Vision Language Action (VLA) models promise an open-vocabulary interface that can translate perceptual ambiguity into semantically grounded driving decisions, yet they still treat language as a static prior fixed at inference time. As a result, the model must infer continuously s...
Friday, January 16, 2026
Chengyou Wang, Mingchen Shao, Jingbin Hu ... ยท arXiv
Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech...
Zhuoyue Gao, Xiaohui Wang, Xiaocui Yang ... ยท arXiv
Empathetic speech dialogue requires not only understanding linguistic content but also perceiving rich paralinguistic information such as prosody, tone, and emotional intensity for affective understandings. Existing speech-to-speech large language models either rely on ASR transc...
Tanyu Chen, Tairan Chen, Kai Shen ... ยท arXiv
Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction....
Yirong Sun, Yanjun Chen, Xin Qiu ... ยท arXiv
Large Audio Language Models (LALMs) excel at semantic and paralinguistic tasks, yet their ability to perceive the fundamental physical attributes of audio such as pitch, loudness, and spatial location remains under-explored. To bridge this gap, we introduce SonicBench, a psychoph...
Thursday, January 15, 2026
Dongchao Yang, Yuxin Xie, Yuguo Yin ... ยท arXiv
We present a family of open-source Music Foundation Models designed to advance large-scale music understanding and generation across diverse tasks and modalities. Our framework consists of four major components: (1) HeartCLAP, an audio-text alignment model; (2) HeartTranscriptor,...
Yibo Zhang, Liang Lin, Kaiwen Luo ... ยท arXiv
While Audio Large Models (ALMs) have achieved remarkable proficiency, their robustness remains brittle in real-world deployment. Existing evaluations largely rely on synthetic Gaussian noise or simplistic single-source interference, failing to capture the intricate, multi-layered...
Jingbin Hu, Huakang Chen, Linhan Ma ... ยท arXiv
Despite rapid progress in text-to-speech (TTS), open-source systems still lack truly instruction-following, fine-grained control over core speech attributes (e.g., pitch, speaking rate, age, emotion, and style). We present VoiceSculptor, an open-source unified system that bridges...
Runyuan Cai, Yu Lin, Yiming Wang ... ยท arXiv
Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper,...
Wednesday, January 14, 2026
Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel ... ยท arXiv
This paper addresses unsupervised diffusion-based single-channel speech enhancement (SE). Prior work in this direction combines a score-based diffusion model trained on clean speech with a Gaussian noise model whose covariance is structured by non-negative matrix factorization (N...
Hanlin Zhang, Daxin Tan, Dehua Tao ... ยท arXiv
Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve b...
Pierfrancesco Melucci, Paolo Merialdo, Taketo Akama ยท arXiv
Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as th...
Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian ... ยท arXiv
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both spee...
Tuesday, January 13, 2026
Rahul Bapusaheb Kodag, Vipul Arora ยท arXiv
Tabla Stroke Transcription (TST) is central to the analysis of rhythmic structure in Hindustani classical music, yet remains challenging due to complex rhythmic organization and the scarcity of strongly annotated data. Existing approaches largely rely on fully supervised learning...
Monday, January 12, 2026
Surya Subramani, Hashim Ali, Hafiz Malik ยท arXiv
Speaker-specific anti-spoofing and synthesis-source tracing are central challenges in audio anti-spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we intr...
Guobin Ma, Yuxuan Xia, Jixun Yao ... ยท arXiv
This paper summarizes the ICASSP 2026 Automatic Song Aesthetics Evaluation (ASAE) Challenge, which focuses on predicting the subjective aesthetic scores of AI-generated songs. The challenge consists of two tracks: Track 1 targets the prediction of the overall musicality score, wh...
Tiantian Feng, Anfeng Xu, Jinkook Lee ... ยท arXiv
In this work, we present a novel perspective on cognitive impairment classification from speech by integrating speech foundation models that explicitly recognize speech dialects. Our motivation is based on the observation that individuals with Alzheimer's Disease (AD) or mild cog...
Yuanhe Zhang, Jiayu Tian, Yibo Zhang ... ยท arXiv
Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing ...
Sunday, January 11, 2026
Mingyue Huo, Yiwen Shao, Yuheng Zhang ยท arXiv
We present TagSpeech, a unified LLM-based framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization. The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn ...
Mohd Mujtaba Akhtar, Girish, Muskaan Singh ยท EACL 2026
In this study, we present a multimodal framework for predicting neuro-facial disorders by capturing both vocal and facial cues. We hypothesize that explicitly disentangling shared and modality-specific representations within multimodal foundation model embeddings can enhance clin...
Mohd Mujtaba Akhtar, Girish, Farhan Sheth ... ยท EACL 2026
We propose a unified framework for not only attributing synthetic speech to its source but also for detecting speech generated by synthesizers that were not encountered during training. This requires methods that move beyond simple detection to support both detailed forensic anal...