Audio ML Papers

Week of March 01 - March 08, 2026

Subcategories: All (44) | Speech Synthesis (7) | Music Synthesis (3) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (2) | Asr (12) | Other (20)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 84)
Gaia A. Bertolino, Yuwei Zhang, Tong Xia ... · arXiv
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings...
#2 TOP PAPER (Score: 84)
Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel ... · arXiv
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specia...
#3 TOP PAPER (Score: 83)
Siminfar Samakoush Galougah, Pranav Pulijala, Ramani Duraiswami · arXiv
A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level a...
Saturday, March 07, 2026
Zahra Mansour, Verena Uslar, Dirk Weyhe ... · arXiv
Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable aut...
Wenjie Tian, Mingchen Shao, Bingshen Mu ... · arXiv
Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR inclu...
Friday, March 06, 2026
Gaia A. Bertolino, Yuwei Zhang, Tong Xia ... · arXiv
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings...
Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel ... · arXiv
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specia...
Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni ... · arXiv
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a contro...
Daixian Li, Jun Xue, Yanzhen Ren ... · arXiv
Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors c...
Junhyeok Lee, Xiluo He, Jihwan Lee ... · arXiv
Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that sel...
Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng · arXiv
We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant aco...
Changsong Liu, Tianrui Wang, Ye Ni ... · arXiv
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, ad...
Hoseong Ahn, Jeongyun Chae, Yoonji Park ... · arXiv
Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We pro...
Jinuo Sun, Yang Xiao, Sung Kyun Chung ... · arXiv
Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and inv...
Thursday, March 05, 2026
Jihwan Lee, Parsa Razmara, Kevin Huang ... · arXiv
Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological ...
Marvin Lavechin, Elika Bergelson, Roger Levy · arXiv
Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribe...
Jielin Qiu, Zixiang Chen, Liangwei Yang ... · arXiv
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to ...
Yen-Shan Chen, Shih-Yu Lai, Ying-Jung Tsou ... · arXiv
While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the impercept...
Han Yin, Yang Xiao, Rohan Kumar Das ... · arXiv
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake...
Han Yin, Yang Xiao, Rohan Kumar Das ... · arXiv
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake...
Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin ... · arXiv
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline lever...
Linghan Fang, Tianxin Xie, Li Liu · arXiv
Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue...
Akif Islam, Raufun Nahar, Md. Ekramul Hamid · IEEE Conference Paper
Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero...
Aemon Yat Fei Chiu, Yujia Xiao, Qiuqiang Kong ... · arXiv
Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in sp...
Wednesday, March 04, 2026
Cemal Hanilçi, Md Sahidullah, Tomi Kinnunen · arXiv
Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven...
Zachary Novack, Zack Zukowski, CJ Carr ... · ICASSP 2026
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-...
Fei Su, Cancan Li, Juan Liu ... · arXiv
Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AV...
Kevin Wilkinghoff, Sarthak Yadav, Zheng-Hua Tan · arXiv
Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However,...
Fabian Ritter-Gutierrez, Md Asif Jalal, Pablo Peso Parada ... · arXiv
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose Fl...
Tuesday, March 03, 2026
Szu-Wei Fu, Rong Chao, Xuesong Yang ... · arXiv
Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved....
Mathuranathan Mayuravaani, W. Bastiaan Kleijn, Andrew Lensen ... · arXiv
This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increa...
Franziska Braun, Christopher Witzl, Florian Hönig ... · LREC 2026
Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impa...
Kashaf Gulzar, Korbinian Riedhammer, Elmar Nöth ... · arXiv
Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents...
Franziska Braun, Christopher Witzl, Florian Hönig ... · LREC 2026
Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impa...
Xin Wang, Ge Wanying, Junichi Yamagishi · arXiv
Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (S...
Monday, March 02, 2026
Hashim Ali, Nithin Sai Adupa, Surya Subramani ... · ICASSP
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we...
Siminfar Samakoush Galougah, Pranav Pulijala, Ramani Duraiswami · arXiv
A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level a...
Minghui Wu, Xueling Liu, Jiahuan Fan ... · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, Singapore, 2025, pp. 1104-1109 · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pa...
Minghui Wu, Haitao Tang, Jiahuan Fan ... · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, 2025, pp. 1092-1097 · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly...
Ya Jiang, Ruoyu Wang, Jingxuan Zhang ... · arXiv
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialo...
Loan Do, Thanh Ngoc Nguyen, Thanh Pham ... · arXiv
We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read spe...
Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov ... · arXiv
We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To eva...
Lixing He, Zhouxuan Chen, Mingshuai Liu ... · arXiv
We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from 32 kbps to 128 kbps, aligning with the st...
Sunday, March 01, 2026
Pengfei Zhang, Tianxin Xie, Minghao Yang ... · arXiv
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is ...
Hongrui Wang, Fan Zhang, Zhiyuan Yu ... · ICLR 2026
Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between trac...
Yanir Marmor, Arad Zulti, David Krongauz ... · arXiv
Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 s...
Yanir Marmor, Arad Zulti, David Krongauz ... · arXiv
Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 s...