Audio ML Papers

Last 7 Days (March 20 - March 27, 2026)

Subcategories: All (29) | Speech Synthesis (8) | Music Synthesis (4) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 91)
Kangxiang Xia, Bingshen Mu, Xian Shi ... ยท ICME 2026
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end...
#2 TOP PAPER (Score: 86)
Xingchen Song, Di Wu, Dinghao Zhou ... ยท arXiv
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it...
#3 TOP PAPER (Score: 84)
Xin Guo, Chunrui Zhao, Hong Jia ... ยท arXiv
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fi...
Wednesday, March 25, 2026
Kangxiang Xia, Bingshen Mu, Xian Shi ... ยท ICME 2026
Achieving natural full-duplex interaction in spoken dialogue systems (SDS) remains a challenge due to the difficulty of accurately detecting user interruptions. Current solutions are polarized between "trigger-happy" VAD-based methods that misinterpret backchannels and robust end...
Yadong Niu, Tianzi Wang, Heinrich Dinkel ... ยท ICASSP 2026
General audio understanding is a fundamental goal for large audio-language models, with audio captioning serving as a cornerstone task for their development. However, progress in this domain is hindered by existing datasets, which lack the scale and descriptive granularity requir...
Shengfan Shen, Di Wu, Xingchen Song ... ยท arXiv
Reliable evaluation of modern zero-shot text-to-speech (TTS) models remains challenging. Subjective tests are costly and hard to reproduce, while objective metrics often saturate, failing to distinguish SOTA systems. To address this, we propose Iterate to Differentiate (I2D), an ...
Massa Baali, Sarthak Bisht, Rita Singh ... ยท arXiv
Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum ...
Chunbo Hao, Junjie Zheng, Guobin Ma ... ยท arXiv
Regenerating singing voices with altered lyrics while preserving melody consistency remains challenging, as existing methods either offer limited controllability or require laborious manual alignment. We propose YingMusic-Singer, a fully diffusion-based model enabling melody-cont...
Zhongweiyang Xu, Ashutosh Pandey, Juan Azcarreta ... ยท ICASSP 2026
Multi-channel speech enhancement aims to recover clean speech from noisy multi-channel recordings. Most deep learning methods employ discriminative training, which can lead to non-linear distortions from regression-based objectives, especially under challenging environmental nois...
Tuesday, March 24, 2026
Lucas H. Ueda, Joรฃo G. T. Lima, Paula D. P. Costa ยท arXiv
Speech Emotion Recognition (SER) in real-world scenarios remains challenging due to severe class imbalance and the prevalence of spontaneous, natural speech. While recent approaches leverage self-supervised learning (SSL) representations and multimodal fusion of speech and text, ...
Octavian Pascu, Dan Oneata, Horia Cucu ... ยท arXiv
We introduce Echoes, a new dataset for music deepfake detection designed for training and benchmarking detectors under realistic and provider-diverse conditions. Echoes comprises 3,577 tracks (110 hours of audio) spanning multiple genres (pop, rock, electronic), and includes cont...
Zikang Huang, Meng Ge, Tianrui Wang ... ยท arXiv
Self-supervised learning (SSL) has advanced speech processing. However, existing speech SSL methods typically assume a single sampling rate and struggle with mixed-rate data due to temporal resolution mismatch. To address this limitation, we propose MSRHuBERT, a multi-sampling-ra...
Heinrich Dinkel, Jiahao Zhou, Guanbo Wang ... ยท arXiv
This paper presents the Interspeech 2026 Audio Encoder Capability Challenge, a benchmark specifically designed to evaluate and advance the performance of pre-trained audio encoders as front-end modules for Large Audio Language Models (LALMs). While LALMs have shown remarkable und...
Saurabh Kataria, Xiao Hu ยท arXiv
Audio-Language Models (ALMs) are making strides in understanding speech and non-speech audio. However, domain-specialist Foundation Models (FMs) remain the best for closed-ended speech processing tasks such as Speech Emotion Recognition (SER). Using ALMs for Zero-shot SER is a po...
Monday, March 23, 2026
Xin Guo, Chunrui Zhao, Hong Jia ... ยท arXiv
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fi...
Tianyu Cao, Helin Wang, Ari Frummer ... ยท arXiv
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To ...
Risa Shinoda, Kaede Shiohara, Nakamasa Inoue ... ยท arXiv
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocaliza...
Chengzhi Li, Heyan Huang, Ping Jian ... ยท ICASSP 2026
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of so...
Lucas H. Ueda, Joรฃo G. T. Lima, Pedro R. Corrรชa ... ยท arXiv
This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglem...
Sunday, March 22, 2026
Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara ... ยท arXiv
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning Engl...
Jianyi Chen, Rongxiu Zhong, Shilei Zhang ... ยท arXiv
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful tri...
Saturday, March 21, 2026
Liyun Zhang, Xuanmeng Sha, Shuqiong Wu ... ยท arXiv
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation ...
Kyudan Jung, Jihwan Kim, Minwoo Lee ... ยท arXiv
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models strug...
Jingbin Hu, Haoyu Zhang, Dake Guo ... ยท arXiv
Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including sp...
Friday, March 20, 2026
Xingchen Song, Di Wu, Dinghao Zhou ... ยท arXiv
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it...
Insung Lee, Taeyoung Jeong, Haejun Yoo ... ยท arXiv
While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overloo...
Philippe Gonzalez, Vera Margrethe Frederiksen, Torsten Dau ... ยท IEEE Transactions on Audio, Speech, and Language Processing
A multi-task learning framework is proposed for optimizing a single deep neural network (DNN) for joint noise reduction (NR) and hearing loss compensation (HLC). A distinct training objective is defined for each task, and the DNN predicts two time-frequency masks. During inferenc...
You Li, Dewei Zhou, Fan Ma ... ยท IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-scre...
Yuqian Zhang, Donghua Yu, Zhengyuan Lin ... ยท arXiv
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency,...
Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar ... ยท The 2nd International Workshop on Bodily Expressed Emotion Understanding (BEEU) at AAAI 2026
Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or li...
Yen-Ting Piao, Jay Chiehen Liao, Wei-Tang Chien ... ยท arXiv
While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framewo...
Candice R. Gerstner ยท arXiv
With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forens...