Audio ML Papers

Last 7 Days (March 10 - March 17, 2026)

Subcategories: All (35) | Speech Synthesis (6) | Music Synthesis (2) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (3) | Asr (6) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (13)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 88)
Yaofeng Su, Yuming Li, Zeyue Xue ... ยท arXiv
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidire...
#2 TOP PAPER (Score: 86)
Aviad Dahan, Moran Yanuka, Noa Kraicer ... ยท arXiv
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference record...
#3 TOP PAPER (Score: 86)
Kele Xu, Yifan Wang, Ming Feng ... ยท arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
Friday, March 13, 2026
Jaden Pieper, Stephen D. Voran ยท arXiv
Objective estimators of multimedia quality are often judged by comparing estimates with subjective "truth data," most often via Pearson correlation coefficient (PCC) or mean-squared error (MSE). But subjective test results contain noise, so striving for a PCC of 1.0 or an MSE of ...
Ridwan Arefeen, Xiaoxiao Miao, Rong Tong ... ยท arXiv
Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encode...
Mengjie Zhao, Lianbo Liu, Yusuke Fujita ... ยท arXiv
SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substant...
Thursday, March 12, 2026
Yaofeng Su, Yuming Li, Zeyue Xue ... ยท arXiv
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidire...
Kele Xu, Yifan Wang, Ming Feng ... ยท arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
Yongjoon Lee, Jung-Woo Choi ยท arXiv
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote gen...
Xiquan Li, Junxi Liu, Wenxi Chen ... ยท arXiv
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direc...
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim ... ยท ICLR 2026
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding l...
Joonyong Park, Jerry Li ยท arXiv
Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliabl...
Xiangyuan Xue, Jiajun Lu, Yan Gao ... ยท arXiv
Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the p...
Wednesday, March 11, 2026
Hillary Mutisya, John Mugane ยท arXiv
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we a...
Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez ... ยท arXiv
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-a...
Duojia Li, Shuhan Zhang, Zihan Qian ... ยท arXiv
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases l...
Kaituo Xu, Yan Jia, Kai Huang ... ยท arXiv
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All m...
Jing Peng, Ziyi Chen, Haoyu Li ... ยท arXiv
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritiz...
Yuanbo Hou, Yanru Wu, Qiaoqiao Ren ... ยท arXiv
Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to se...
Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang ... ยท arXiv
The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than fem...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... ยท arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Yinfeng Xia, Jian Tang, Junfeng Hou ... ยท arXiv
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unifie...
Nolan Chan, Timmy Gang, Yongqian Wang ... ยท ICASSP 2026
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core in...
Yujie Liao, Xuelong Geng, Hongfei Xue ... ยท arXiv
Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant ga...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... ยท arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov ... ยท arXiv
The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD met...
Hao Shi, Yusuke Fujita, Roman Koshkin ... ยท arXiv
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encode...
Tianyu Xu, Sieun Kim, Qianhui Zheng ... ยท arXiv
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-...
Tuesday, March 10, 2026
Aviad Dahan, Moran Yanuka, Noa Kraicer ... ยท arXiv
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference record...
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang ... ยท arXiv
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges o...
Haoyuan Yang, Mu Yang, Jiamin Xie ... ยท arXiv
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion back...
Chih-Kai Yang, Yun-Shao Tsai, Yu-Kai Guo ... ยท arXiv
While multi-audio understanding is critical for large audio-language models (LALMs), it remains underexplored. We introduce MUGEN, a comprehensive benchmark evaluating this capability across speech, general audio, and music. Our experiments reveal consistent weaknesses in multi-a...
Dehua Tao, Xuan Luo, Daxin Tan ... ยท arXiv
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for ...
Xiaobin Rong, Jun Gao, Zheng Wang ... ยท arXiv
Achieving high perceptual quality without hallucination remains a challenge in generative speech enhancement (SE). A representative approach, PASE, is robust to hallucination but has limited perceptual quality under adverse conditions. We propose StuPASE, built upon PASE to achie...
Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak ยท arXiv
Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, ...
Soumya Dutta ยท arXiv
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human e...
Laya Iyer, Angelina Wang, Sanmi Koyejo ยท EACL 2026
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech re...
Rui Wang, Zhifei Zhang, Yu Gao ... ยท arXiv
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimi...