Audio ML Papers

Last 7 Days (March 07 - March 14, 2026)

Subcategories: All (41) | Speech Synthesis (8) | Music Synthesis (2) | Ambient Synthesis (2) | Quality Assessment (2) | Enhancement (3) | Asr (11) | Other (13)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 90)
Kele Xu, Yifan Wang, Ming Feng ... · arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
#2 TOP PAPER (Score: 84)
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang ... · arXiv
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges o...
#3 TOP PAPER (Score: 84)
Yongjoon Lee, Jung-Woo Choi · arXiv
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote gen...
Thursday, March 12, 2026
Kele Xu, Yifan Wang, Ming Feng ... · arXiv
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradig...
Yongjoon Lee, Jung-Woo Choi · arXiv
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote gen...
Xiquan Li, Junxi Liu, Wenxi Chen ... · arXiv
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direc...
Hyung-Seok Oh, Deok-Hyeon Cho, Seung-Bin Kim ... · ICLR 2026
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding l...
Wednesday, March 11, 2026
Hillary Mutisya, John Mugane · arXiv
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we a...
Duojia Li, Shuhan Zhang, Zihan Qian ... · arXiv
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases l...
George Saon, Samuel Thomas, Takashi Fukuda ... · arXiv
We propose self-speculative decoding for speech-aware LLMs by using the CTC encoder as a draft model to accelerate auto-regressive (AR) inference and improve ASR accuracy. Our three-step procedure works as follows: (1) if the frame entropies of the CTC output distributions are be...
Nolan Chan, Timmy Gang, Yongqian Wang ... · ICASSP 2026
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core in...
Anupam Purwar, Aditya Choudhary · arXiv
Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Languag...
Kaituo Xu, Yan Jia, Kai Huang ... · arXiv
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All m...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... · arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh ... · arXiv
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing counterm...
Yinfeng Xia, Jian Tang, Junfeng Hou ... · arXiv
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unifie...
Tianyu Xu, Sieun Kim, Qianhui Zheng ... · arXiv
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-...
Jing Peng, Ziyi Chen, Haoyu Li ... · arXiv
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritiz...
Hao Shi, Yusuke Fujita, Roman Koshkin ... · arXiv
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encode...
Tuesday, March 10, 2026
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang ... · arXiv
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges o...
Soumya Dutta · arXiv
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human e...
Haoyuan Yang, Mu Yang, Jiamin Xie ... · arXiv
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion back...
Aviad Dahan, Moran Yanuka, Noa Kraicer ... · arXiv
Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference record...
Robin Doerfler, Lonce Wyse · arXiv
Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal s...
Dehua Tao, Xuan Luo, Daxin Tan ... · arXiv
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for ...
Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak · arXiv
Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, ...
Laya Iyer, Angelina Wang, Sanmi Koyejo · EACL 2026
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech re...
Rui Wang, Zhifei Zhang, Yu Gao ... · arXiv
Keyword spotting (KWS) is crucial for many speech-driven applications, but robust KWS in noisy environments remains challenging. Conventional systems often rely on single-channel inputs and a cascaded pipeline separating front-end enhancement from KWS. This precludes joint optimi...
Monday, March 09, 2026
Pol Buitrago, Pol Gàlvez, Oriol Pareras ... · arXiv
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resourc...
Avihu Dekel, Samuel Thomas, Takashi Fukada ... · arXiv
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully ...
Bence Mark Halpern, Thomas Tienkamp, Defne Abur ... · arXiv
Automatic speech intelligibility assessment is crucial for monitoring speech disorders and therapy efficacy. However, existing methods are difficult to compare: research is fragmented across private datasets with inconsistent protocols. We introduce PathBench, a unified benchmark...
Nikita Kuzmin, Tao Zhong, Jiajun Deng ... · arXiv
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states o...
Andong Li, Tong Lei, Zhihang Sun ... · arXiv
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inher...
Andong Li, Tong Lei, Zhihang Sun ... · arXiv
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inher...
Ayush Barik, Sofia Stoica, Nikhil Sarda ... · arXiv
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusi...
Henry Li Xinyuan, Zexin Cai, Lin Zhang ... · arXiv
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice...
Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan ... · arXiv
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Addit...
Zihao Fang, Yingda Shen, Zifan Guan ... · arXiv
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic re...
Shangeth Rajaa · arXiv
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, whic...
Lucas Rakotoarivony · arXiv
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In p...
Phillip Long, Zachary Novack, Chris Donahue · arXiv
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchm...
Sunday, March 08, 2026
Longbiao Cheng, Shih-Chii Liu · ICASSP 2026
Recent studies have shown that post-deployment adaptation can improve the robustness of speech enhancement models in unseen noise conditions. However, existing methods often incur prohibitive computational and memory costs, limiting their suitability for on-device deployment. In ...
Saturday, March 07, 2026
Zahra Mansour, Verena Uslar, Dirk Weyhe ... · arXiv
Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable aut...
Wenjie Tian, Mingchen Shao, Bingshen Mu ... · arXiv
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR inclu...