Audio ML Papers

Week of December 21 - December 28, 2025

Subcategories: All (14) | Speech Synthesis (1) | Music Synthesis (2) | Ambient Synthesis (0) | Quality Assessment (0) | Enhancement (0) | Asr (1) | Other (10)
← Previous Week | Current Week →

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 85)
Zhongren Dong, Bin Wang, Jing Han ... ยท arXiv
Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism...
#2 TOP PAPER (Score: 83)
Lisan Al Amin, Vandana P. Janeja ยท ICDM 2025-MLC workshop
Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here...
#3 TOP PAPER (Score: 83)
Ye Tao, Xuenan Xu, Wen Wu ... ยท arXiv
Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, whi...
Friday, December 26, 2025
Ruihao Jing, Cheng Gong, Yu Jiang ... ยท arXiv
Rare words remain a critical bottleneck for speech-to-text systems. While direct fine-tuning improves recognition of target words, it often incurs high cost, catastrophic forgetting, and limited scalability. To address these challenges, we propose a training-free paradigm based o...
Thursday, December 25, 2025
Liuyang Bai, Weiyi Lu, Li Guo ยท arXiv
Speech codecs are traditionally optimized for waveform fidelity, allocating bits to preserve acoustic detail even when much of it can be inferred from linguistic structure. This leads to inefficient compression and suboptimal performance on downstream recognition tasks. We propos...
Wednesday, December 24, 2025
Zhongren Dong, Bin Wang, Jing Han ... ยท arXiv
Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism...
Haoyang Li, Xuyi Zhuang, Azmat Adnan ... ยท arXiv
Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens...
Hongyu Wang, Chenda Li, Xin Zhou ... ยท AAAI 2026
Sound separation (SS) and target sound extraction (TSE) are fundamental techniques for addressing complex acoustic scenarios. While existing SS methods struggle with determining the unknown number of sound sources, TSE approaches require precisely specified clues to achieve optim...
Tuesday, December 23, 2025
Yicheng Gu, Junan Zhang, Chaoren Wang ... ยท arXiv
Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance....
Ye Tao, Xuenan Xu, Wen Wu ... ยท arXiv
Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, whi...
Chengwei Liu, Haoyin Yan, Shaofei Xue ... ยท arXiv
Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust inst...
Doyeop Kwak, Youngjoon Jang, Joon Son Chung ยท arXiv
The goal of this paper is to provide a new perspective on speech modeling by incorporating perceptual invariances such as amplitude scaling and temporal shifts. Conventional generative formulations often treat each dataset sample as a fixed representative of the target distributi...
Monday, December 22, 2025
Fan Yu, Tao Wang, You Wu ... ยท arXiv
Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce Joy...
Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang ... ยท arXiv
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings ac...
Wenyu Luo, Jinhui Chen ยท arXiv
Speech intelligibility assessment is essential for many speech-related applications. However, most objective intelligibility metrics are intrusive, as they require clean reference speech in addition to the degraded or processed signal for evaluation. Furthermore, existing metrics...
Sunday, December 21, 2025
Lisan Al Amin, Vandana P. Janeja ยท ICDM 2025-MLC workshop
Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here...
Riki Shimizu, Xilin Jiang, Nima Mesgarani ยท arXiv
Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically ...