Audio ML Papers

Last 7 Days (March 11 - March 18, 2026)

Subcategories: All (43) | Speech Synthesis (9) | Music Synthesis (5) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (1) | Asr (7) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (17)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 92)
Tianyi Tan, Jiaxin Ye, Yuanming Zhang ... ยท arXiv
Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity gen...
#2 TOP PAPER (Score: 88)
Yaofeng Su, Yuming Li, Zeyue Xue ... ยท arXiv
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidire...
#3 TOP PAPER (Score: 88)
Jingyu Lu, Yuhan Wang, Fan Zhuo ... ยท arXiv
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving pros...
Monday, March 16, 2026
Tianyi Tan, Jiaxin Ye, Yuanming Zhang ... ยท arXiv
Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity gen...
Jingyu Lu, Yuhan Wang, Fan Zhuo ... ยท arXiv
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving pros...
Qinke Ni, Huan Liao, Dekun Chen ... ยท arXiv
While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that tr...
Ro-hoon Oh, Jihwan Seol, Bugeun Kim ยท arXiv
Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX ...
Pengjun Fang, Yingqing He, Yazhou Xing ... ยท ICLR 2026
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coars...
Lingsi Zhu, Yuefeng Zou, Yunxiang Zhang ... ยท arXiv
Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially whe...
Changda Chen, Yichen Yang, Wei Liu ... ยท ICASSP 2026
Extracting a target source from underdetermined mixtures is challenging for beamforming approaches. Recently proposed time-frequency-bin-wise switching (TFS) and linear combination (TFLC) strategies mitigate this by combining multiple beamformers in each time-frequency (TF) bin a...
Sunday, March 15, 2026
Wen-Chin Huang, Nicholas Sanders, Erica Cooper ยท arXiv
We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. Th...
Qibing Bai, Yuhan Du, Tom Ko ... ยท arXiv
Existing accent normalization methods do not typically offer control over accent strength, yet many applications-such as language learning and dubbing-require tunable accent retention. We propose DLM-AN, a controllable accent normalization system built on masked discrete diffusio...
Lok-Lam Ieong, Chia-Chien Chen, Chih-Kai Yang ... ยท arXiv
Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We in...
Saturday, March 14, 2026
Wei-Chih Chen, Chien-yu Huang, Hung-yi Lee ยท arXiv
Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehe...
Jiahui Wu ยท arXiv
Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead...
Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang ... ยท arXiv
In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimi...
Soham Ray, Keshav Dhandhania, Victor Barres ... ยท arXiv
Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $ฯ„$-voice, a benchmark for evaluating voice agents ...
Jiabao Ai, Minghui Zhao, Anton Ragni ยท arXiv
Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We ...
Friday, March 13, 2026
Jaden Pieper, Stephen D. Voran ยท arXiv
Objective estimators of multimedia quality are often judged by comparing estimates with subjective "truth data," most often via Pearson correlation coefficient (PCC) or mean-squared error (MSE). But subjective test results contain noise, so striving for a PCC of 1.0 or an MSE of ...
Ridwan Arefeen, Xiaoxiao Miao, Rong Tong ... ยท arXiv
Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encode...
Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze ยท INTERSPEECH'26
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance...
Gabriel Pรฎrlogeanu, Adriana Stan, Horia Cucu ยท ICASSP 2026
Audio deepfake model attribution aims to mitigate the misuse of synthetic speech by identifying the source model responsible for generating a given audio sample, enabling accountability and informing vendors. The task is challenging, but self-supervised learning (SSL)-derived aco...
Mengjie Zhao, Lianbo Liu, Yusuke Fujita ... ยท arXiv
SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substant...
Thursday, March 12, 2026
Yaofeng Su, Yuming Li, Zeyue Xue ... ยท arXiv
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidire...
Kele Xu, Yifan Wang, Ming Feng ... ยท arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
Umberto Cappellazzo, Stavros Petridis, Maja Pantic ยท arXiv
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AV...
Yongjoon Lee, Jung-Woo Choi ยท arXiv
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote gen...
Xiquan Li, Junxi Liu, Wenxi Chen ... ยท arXiv
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direc...
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim ... ยท ICLR 2026
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding l...
Joonyong Park, Jerry Li ยท arXiv
Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliabl...
Xiangyuan Xue, Jiajun Lu, Yan Gao ... ยท arXiv
Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the p...
Wednesday, March 11, 2026
Hillary Mutisya, John Mugane ยท arXiv
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we a...
Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez ... ยท arXiv
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-a...
Duojia Li, Shuhan Zhang, Zihan Qian ... ยท arXiv
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases l...
Kaituo Xu, Yan Jia, Kai Huang ... ยท arXiv
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All m...
Jing Peng, Ziyi Chen, Haoyu Li ... ยท arXiv
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritiz...
Yuanbo Hou, Yanru Wu, Qiaoqiao Ren ... ยท arXiv
Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to se...
Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang ... ยท arXiv
The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than fem...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... ยท arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Yinfeng Xia, Jian Tang, Junfeng Hou ... ยท arXiv
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unifie...
Nolan Chan, Timmy Gang, Yongqian Wang ... ยท ICASSP 2026
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core in...
Yujie Liao, Xuelong Geng, Hongfei Xue ... ยท arXiv
Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant ga...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... ยท arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov ... ยท arXiv
The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD met...
Hao Shi, Yusuke Fujita, Roman Koshkin ... ยท arXiv
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encode...
Tianyu Xu, Sieun Kim, Qianhui Zheng ... ยท arXiv
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-...