Audio ML Papers

Last 7 Days (March 18 - March 25, 2026)

Subcategories: All (27) | Speech Synthesis (7) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 86)
Xingchen Song, Di Wu, Dinghao Zhou ... · arXiv
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it...
#2 TOP PAPER (Score: 84)
Matthew Wiesner, Samuele Cornell, Alexander Polok ... · arXiv
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We t...
#3 TOP PAPER (Score: 84)
Xin Guo, Chunrui Zhao, Hong Jia ... · arXiv
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fi...
Monday, March 23, 2026
Xin Guo, Chunrui Zhao, Hong Jia ... · arXiv
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fi...
Tianyu Cao, Helin Wang, Ari Frummer ... · arXiv
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To ...
Risa Shinoda, Kaede Shiohara, Nakamasa Inoue ... · arXiv
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocaliza...
Chengzhi Li, Heyan Huang, Ping Jian ... · ICASSP 2026
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of so...
Lucas H. Ueda, João G. T. Lima, Pedro R. Corrêa ... · arXiv
This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglem...
Sunday, March 22, 2026
Zi Haur Pang, Xiaoxue Gao, Tatsuya Kawahara ... · arXiv
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning Engl...
Jianyi Chen, Rongxiu Zhong, Shilei Zhang ... · arXiv
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful tri...
Saturday, March 21, 2026
Liyun Zhang, Xuanmeng Sha, Shuqiong Wu ... · arXiv
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation ...
Kyudan Jung, Jihwan Kim, Minwoo Lee ... · arXiv
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models strug...
Jingbin Hu, Haoyu Zhang, Dake Guo ... · arXiv
Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including sp...
Friday, March 20, 2026
Xingchen Song, Di Wu, Dinghao Zhou ... · arXiv
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it...
Insung Lee, Taeyoung Jeong, Haejun Yoo ... · arXiv
While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overloo...
Philippe Gonzalez, Vera Margrethe Frederiksen, Torsten Dau ... · IEEE Transactions on Audio, Speech, and Language Processing
A multi-task learning framework is proposed for optimizing a single deep neural network (DNN) for joint noise reduction (NR) and hearing loss compensation (HLC). A distinct training objective is defined for each task, and the DNN predicts two time-frequency masks. During inferenc...
You Li, Dewei Zhou, Fan Ma ... · IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-scre...
Yuqian Zhang, Donghua Yu, Zhengyuan Lin ... · arXiv
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency,...
Lokesh Kumar, Nirmesh Shah, Ashishkumar P. Gudmalwar ... · The 2nd International Workshop on Bodily Expressed Emotion Understanding (BEEU) at AAAI 2026
Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or li...
Yen-Ting Piao, Jay Chiehen Liao, Wei-Tang Chien ... · arXiv
While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framewo...
Candice R. Gerstner · arXiv
With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forens...
Thursday, March 19, 2026
Yuchen Su, Shaoxin Zhong, Yonghua Zhu ... · arXiv
Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datas...
Amandine Brunetto · arXiv
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for ea...
Jihoon Jeong, Pooneh Mousavi, Mirco Ravanelli ... · arXiv
Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with expli...
Wednesday, March 18, 2026
Matthew Wiesner, Samuele Cornell, Alexander Polok ... · arXiv
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We t...
Zechang Xiong, Da Li, Kexin Tang ... · ICME 2026
Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or l...
Yitian Gong, Botian Jiang, Yiwei Zhao ... · arXiv
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12....
Aivo Olev, Tanel Alumäe · arXiv
Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, i...
Donghang Wu, Tianyu Zhang, Yuxin Li ... · arXiv
During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. I...
Vadim Rozenfeld, Bracha Laufer Goldshtein · arXiv
Reliable Sound Source Localization (SSL) plays an essential role in many downstream tasks, where informed decision making depends not only on accurate localization but also on the confidence in each estimate. This need for reliability becomes even more pronounced in challenging c...