Audio ML Papers

Last 7 Days (March 13 - March 20, 2026)

Subcategories: All (31) | Speech Synthesis (6) | Music Synthesis (4) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (1) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (15)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 92)
Tianyi Tan, Jiaxin Ye, Yuanming Zhang ... · arXiv
Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity gen...
#2 TOP PAPER (Score: 88)
Jingyu Lu, Yuhan Wang, Fan Zhuo ... · arXiv
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving pros...
#3 TOP PAPER (Score: 84)
Chuyang Chen, Bea Steers, Brian McFee ... · ICASSP 2026
We propose a benchmark for evaluating compositionality in audio representations. Audio compositionality refers to representing sound scenes in terms of constituent sources and attributes, and combining them systematically. While central to auditory perception, this property is la...
Wednesday, March 18, 2026
Zechang Xiong, Da Li, Kexin Tang ... · ICME 2026
Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or l...
Aivo Olev, Tanel Alumäe · arXiv
Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, i...
Tuesday, March 17, 2026
Houmin Sun, Zi Hu, Linxi Li ... · arXiv
Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems u...
Kuan-Tang Huang, Chien-Chun Wang, Cheng-Yeh Yang ... · IEEE ICME 2026
The rapid proliferation of AI-Generated Content (AIGC) has necessitated robust metrics for perceptual quality assessment. However, automatic Mean Opinion Score (MOS) prediction models are often compromised by data scarcity, predisposing them to learn spurious correlations-- such ...
Shubham Gupta, Adarsh Arigala, B. R. Dilleswari ... · IEEE ICASSP 2026
Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a ref...
Alejandro Paredes La Torre · arXiv
Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, pa...
Monday, March 16, 2026
Tianyi Tan, Jiaxin Ye, Yuanming Zhang ... · arXiv
Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity gen...
Jingyu Lu, Yuhan Wang, Fan Zhuo ... · arXiv
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving pros...
Qinke Ni, Huan Liao, Dekun Chen ... · arXiv
While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that tr...
Jaesung Bae, Xiuwen Zheng, Minje Kim ... · arXiv
Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a th...
Ro-hoon Oh, Jihwan Seol, Bugeun Kim · arXiv
Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX ...
Pengjun Fang, Yingqing He, Yazhou Xing ... · ICLR 2026
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coars...
Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang ... · arXiv
Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially whe...
Changda Chen, Yichen Yang, Wei Liu ... · ICASSP 2026
Extracting a target source from underdetermined mixtures is challenging for beamforming approaches. Recently proposed time-frequency-bin-wise switching (TFS) and linear combination (TFLC) strategies mitigate this by combining multiple beamformers in each time-frequency (TF) bin a...
Sunday, March 15, 2026
Wen-Chin Huang, Nicholas Sanders, Erica Cooper · arXiv
We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. Th...
Qibing Bai, Yuhan Du, Tom Ko ... · arXiv
Existing accent normalization methods do not typically offer control over accent strength, yet many applications-such as language learning and dubbing-require tunable accent retention. We propose DLM-AN, a controllable accent normalization system built on masked discrete diffusio...
Bingzhou Li, Tao Huang · arXiv
Omnimodal large language models (OmniLLMs) jointly process audio and visual streams, but the resulting long multimodal token sequences make inference prohibitively expensive. Existing compression methods typically rely on fixed window partitioning and attention-based pruning, whi...
Izzet Turkalp Akbasli, Oguzhan Serin · arXiv
Background: Respiratory diseases are a leading cause of childhood morbidity and mortality, yet lung auscultation remains subjective and limited by inter-listener variability, particularly in pediatric populations. Existing AI approaches are further constrained by small datasets a...
Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki ... · arXiv
Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersection...
Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang ... · arXiv
Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We in...
Saturday, March 14, 2026
Chuyang Chen, Bea Steers, Brian McFee ... · ICASSP 2026
We propose a benchmark for evaluating compositionality in audio representations. Audio compositionality refers to representing sound scenes in terms of constituent sources and attributes, and combining them systematically. While central to auditory perception, this property is la...
Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee · arXiv
Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehe...
Jiahui Wu · arXiv
Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead...
Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang ... · arXiv
In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimi...
Soham Ray, Keshav Dhandhania, Victor Barres ... · arXiv
Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $τ$-voice, a benchmark for evaluating voice agents ...
Jiabao Ai, Minghui Zhao, Anton Ragni · arXiv
Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We ...
Friday, March 13, 2026
Jaden Pieper, Stephen D. Voran · arXiv
Objective estimators of multimedia quality are often judged by comparing estimates with subjective "truth data," most often via Pearson correlation coefficient (PCC) or mean-squared error (MSE). But subjective test results contain noise, so striving for a PCC of 1.0 or an MSE of ...
Ridwan Arefeen, Xiaoxiao Miao, Rong Tong ... · arXiv
Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encode...
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze · INTERSPEECH'26
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance...
Gabriel Pîrlogeanu, Adriana Stan, Horia Cucu · ICASSP 2026
Audio deepfake model attribution aims to mitigate the misuse of synthetic speech by identifying the source model responsible for generating a given audio sample, enabling accountability and informing vendors. The task is challenging, but self-supervised learning (SSL)-derived aco...
Mengjie Zhao, Lianbo Liu, Yusuke Fujita ... · arXiv
SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substant...