Audio ML Papers

Last 7 Days (March 05 - March 12, 2026)

Subcategories: All (40) | Speech Synthesis (7) | Music Synthesis (2) | Ambient Synthesis (1) | Quality Assessment (1) | Enhancement (2) | Asr (8) | Other (19)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 84)
Gaia A. Bertolino, Yuwei Zhang, Tong Xia ... · arXiv
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings...
#2 TOP PAPER (Score: 84)
Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel ... · arXiv
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specia...
#3 TOP PAPER (Score: 84)
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang ... · arXiv
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges o...
Tuesday, March 10, 2026
Xin Jing, Andreas Triantafyllopoulos, Jiadong Wang ... · arXiv
Recent advancements in speech captioning models have enabled the generation of rich, fine-grained captions for emotional speech. However, the evaluation of such captions remains a critical bottleneck: traditional N-gram metrics fail to capture semantic nuances, while LLM judges o...
Soumya Dutta · arXiv
Emotions play a central role in human communication, shaping trust, engagement, and social interaction. As artificial intelligence systems powered by large language models become increasingly integrated into everyday life, enabling them to reliably understand and generate human e...
Haoyuan Yang, Mu Yang, Jiamin Xie ... · arXiv
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion back...
Robin Doerfler, Lonce Wyse · arXiv
Engine sounds originate from sequential exhaust pressure pulses rather than sustained harmonic oscillations. While neural synthesis methods typically aim to approximate the resulting spectral characteristics, we propose directly modeling the underlying pulse shapes and temporal s...
Dehua Tao, Xuan Luo, Daxin Tan ... · arXiv
While large-scale omni-models have demonstrated impressive capabilities across various modalities, their strong performance heavily relies on massive multimodal data and incurs substantial computational costs. This work introduces Speech-Omni-Lite, a cost-efficient framework for ...
Laya Iyer, Angelina Wang, Sanmi Koyejo · EACL 2026
Advances in large language models (LLMs) have enabled significant capabilities in audio processing, resulting in state-of-the-art models now known as Large Audio Language Models (LALMs). However, minimal work has been done to measure audio understanding beyond automatic speech re...
Monday, March 09, 2026
Pol Buitrago, Pol Gàlvez, Oriol Pareras ... · arXiv
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resourc...
Avihu Dekel, Samuel Thomas, Takashi Fukada ... · arXiv
While autoregressive (AR) LLM-based ASR systems achieve strong accuracy, their sequential decoding limits parallelism and incurs high latency. We propose NLE, a non-autoregressive (NAR) approach that formulates speech recognition as conditional transcript editing, enabling fully ...
Nikita Kuzmin, Tao Zhong, Jiajun Deng ... · arXiv
End-to-end full-duplex speech models feed user audio through an always-on LLM backbone, yet the speaker privacy implications of their hidden representations remain unexamined. Following the VoicePrivacy 2024 protocol with a lazy-informed attacker, we show that the hidden states o...
Andong Li, Tong Lei, Zhihang Sun ... · arXiv
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inher...
Andong Li, Tong Lei, Zhihang Sun ... · arXiv
Although deep neural networks have facilitated significant progress of neural vocoders in recent years, they usually suffer from intrinsic challenges like opaque modeling, inflexible retraining under different input configurations, and parameter-performance trade-off. These inher...
Ayush Barik, Sofia Stoica, Nikhil Sarda ... · arXiv
Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusi...
Henry Li Xinyuan, Zexin Cai, Lin Zhang ... · arXiv
We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice...
Hezhao Zhang, Huang-Cheng Chou, Shrikanth Narayanan ... · arXiv
Speech Large Language Models (LLMs) show great promise for speech emotion recognition (SER) via generative interfaces. However, shifting from closed-set classification to open text generation introduces zero-shot stochasticity, making evaluation highly sensitive to prompts. Addit...
Zihao Fang, Yingda Shen, Zifan Guan ... · arXiv
Whispered speech lacks vocal fold vibration and fundamental frequency, resulting in degraded acoustic cues and making whisper-to-normal (W2N) conversion challenging, especially with limited parallel data. We propose WhispEar, a bidirectional framework based on unified semantic re...
Shangeth Rajaa · arXiv
Speech-to-speech models handle turn-taking naturally but offer limited support for tool-calling or complex reasoning, while production ASR-LLM-TTS voice pipelines offer these capabilities but rely on silence timeouts, which lead to unnatural turn-taking. We present DualTurn, whic...
Lucas Rakotoarivony · arXiv
Quantization has become essential for the efficient deployment of speech processing systems. Although widely studied, most existing quantization methods were developed for vision and NLP architectures, while the specific challenges of audio signals remain largely overlooked. In p...
Phillip Long, Zachary Novack, Chris Donahue · arXiv
Autoregressive "language" models (LMs) trained on raw waveforms can be repurposed for lossless audio compression, but prior work is limited to 8-bit audio, leaving open whether such approaches work for practical settings (16/24-bit) and can compete with existing codecs. We benchm...
Sunday, March 08, 2026
Longbiao Cheng, Shih-Chii Liu · ICASSP 2026
Recent studies have shown that post-deployment adaptation can improve the robustness of speech enhancement models in unseen noise conditions. However, existing methods often incur prohibitive computational and memory costs, limiting their suitability for on-device deployment. In ...
Saturday, March 07, 2026
Zahra Mansour, Verena Uslar, Dirk Weyhe ... · arXiv
Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable aut...
Wenjie Tian, Mingchen Shao, Bingshen Mu ... · arXiv
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR inclu...
Friday, March 06, 2026
Gaia A. Bertolino, Yuwei Zhang, Tong Xia ... · arXiv
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings...
Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel ... · arXiv
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specia...
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni ... · arXiv
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a contro...
Daixian Li, Jun Xue, Yanzhen Ren ... · arXiv
Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors c...
Junhyeok Lee, Xiluo He, Jihwan Lee ... · arXiv
Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that sel...
Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng · arXiv
We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant aco...
Changsong Liu, Tianrui Wang, Ye Ni ... · arXiv
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, ad...
Hoseong Ahn, Jeongyun Chae, Yoonji Park ... · arXiv
Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We pro...
Jinuo Sun, Yang Xiao, Sung Kyun Chung ... · arXiv
Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and inv...
Thursday, March 05, 2026
Jihwan Lee, Parsa Razmara, Kevin Huang ... · arXiv
Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological ...
Marvin Lavechin, Elika Bergelson, Roger Levy · arXiv
Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribe...
Jielin Qiu, Zixiang Chen, Liangwei Yang ... · arXiv
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to ...
Yen-Shan Chen, Shih-Yu Lai, Ying-Jung Tsou ... · arXiv
While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the impercept...
Han Yin, Yang Xiao, Rohan Kumar Das ... · arXiv
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake...
Han Yin, Yang Xiao, Rohan Kumar Das ... · arXiv
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake...
Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin ... · arXiv
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline lever...
Linghan Fang, Tianxin Xie, Li Liu · arXiv
Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue...
Akif Islam, Raufun Nahar, Md. Ekramul Hamid · IEEE Conference Paper
Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero...
Aemon Yat Fei Chiu, Yujia Xiao, Qiuqiang Kong ... · arXiv
Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in sp...