Audio ML Papers

Last 7 Days (March 21 - March 28, 2026)

Subcategories: All (24) | Speech Synthesis (4) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (12)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Kangxiang Xia, Bingshen Mu, Xian Shi ... · ICME 2026
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end...
#2 TOP PAPER (Score: 84)
Xin Guo, Chunrui Zhao, Hong Jia ... · arXiv
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fi...
#3 TOP PAPER (Score: 84)
Tianyu Cao, Helin Wang, Ari Frummer ... · arXiv
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To ...
Thursday, March 26, 2026
Shangkun Huang, Huan Shen, Wei Zou ... · arXiv
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We prop...
Huan Shen, Yingao Wang, Shangkun Huang ... · arXiv
Turn-taking modeling is fundamental to spoken dialogue systems, yet its evaluation remains fragmented and often limited to binary boundary detection under narrow interaction settings. Such protocols hinder systematic comparison and obscure model weaknesses across conversational c...
Wednesday, March 25, 2026
Kangxiang Xia, Bingshen Mu, Xian Shi ... · ICME 2026
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end...
Yadong Niu, Tianzi Wang, Heinrich Dinkel ... · ICASSP 2026
General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity requir...
Shengfan Shen, Di Wu, Xingchen Song ... · arXiv
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an ...
Zhongweiyang Xu, Ashutosh Pandey, Juan Azcarreta ... · arXiv
We propose Uni-ArrayDPS, a novel diffusion-based refinement framework for unified multi-channel speech enhancement and separation. Existing methods for multi-channel speech enhancement/separation are mostly discriminative and are highly effective at producing high-SNR outputs. Ho...
Massa Baali, Sarthak Bisht, Rita Singh ... · arXiv
Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum ...
Chunbo Hao, Junjie Zheng, Guobin Ma ... · arXiv
Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-cont...
Zhongweiyang Xu, Ashutosh Pandey, Juan Azcarreta ... · ICASSP 2026
Multi-channel speech enhancement aims to recover clean speech from noisy multi-channel recordings. Most deep learning methods employ discriminative training, which can lead to non-linear distortions from regression-based objectives, especially under challenging environmental nois...
Tuesday, March 24, 2026
Lucas H. Ueda, João G. T. Lima, Paula D. P. Costa · arXiv
Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, ...
Octavian Pascu, Dan Oneata, Horia Cucu ... · arXiv
We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes cont...
Zikang Huang, Meng Ge, Tianrui Wang ... · arXiv
Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-ra...
Heinrich Dinkel, Jiahao Zhou, Guanbo Wang ... · arXiv
This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable und...
Saurabh Kataria, Xiao Hu · arXiv
Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a po...
Monday, March 23, 2026
Xin Guo, Chunrui Zhao, Hong Jia ... · arXiv
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fi...
Tianyu Cao, Helin Wang, Ari Frummer ... · arXiv
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To ...
Risa Shinoda, Kaede Shiohara, Nakamasa Inoue ... · arXiv
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocaliza...
Chengzhi Li, Heyan Huang, Ping Jian ... · ICASSP 2026
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of so...
Lucas H. Ueda, João G. T. Lima, Pedro R. Corrêa ... · arXiv
This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglem...
Sunday, March 22, 2026
Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara ... · arXiv
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning Engl...
Jianyi Chen, Rongxiu Zhong, Shilei Zhang ... · arXiv
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful tri...
Saturday, March 21, 2026
Liyun Zhang, Xuanmeng Sha, Shuqiong Wu ... · arXiv
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation ...
Kyudan Jung, Jihwan Kim, Minwoo Lee ... · arXiv
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models strug...
Jingbin Hu, Haoyu Zhang, Dake Guo ... · arXiv
Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including sp...