Audio ML Papers

Last 7 Days (February 25 - March 04, 2026)

Subcategories: All (23) | Speech Synthesis (3) | Music Synthesis (4) | Ambient Synthesis (2) | Quality Assessment (0) | Enhancement (2) | Asr (5) | Other (7)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 83)
Zeyu Xie, Chenxing Li, Qiao Jin ... · arXiv
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discrimi...
#2 TOP PAPER (Score: 83)
Trung Dang, Sharath Rao, Ananya Gupta ... · arXiv
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are sig...
#3 TOP PAPER (Score: 83)
Yuzhu Wang, Archontis Politis, Konstantinos Drossos ... · IEEE Transactions on Audio, Speech and Language Processing
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both featu...
Monday, March 02, 2026
Hashim Ali, Nithin Sai Adupa, Surya Subramani ... · ICASSP
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we...
Siminfar Samakoush Galougah, Pranav Pulijala, Ramani Duraiswami · arXiv
A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level a...
Minghui Wu, Xueling Liu, Jiahuan Fan ... · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, Singapore, 2025, pp. 1104-1109 · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pa...
Minghui Wu, Haitao Tang, Jiahuan Fan ... · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, 2025, pp. 1092-1097 · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly...
Ya Jiang, Ruoyu Wang, Jingxuan Zhang ... · arXiv
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialo...
Loan Do, Thanh Ngoc Nguyen, Thanh Pham ... · arXiv
We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read spe...
Lixing He, Zhouxuan Chen, Mingshuai Liu ... · arXiv
We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from 32 kbps to 128 kbps, aligning with the st...
Sunday, March 01, 2026
Pengfei Zhang, Tianxin Xie, Minghao Yang ... · arXiv
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is ...
Hongrui Wang, Fan Zhang, Zhiyuan Yu ... · ICLR 2026
Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between trac...
Yanir Marmor, Arad Zulti, David Krongauz ... · arXiv
Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 s...
Saturday, February 28, 2026
Seunghyun Oh, Malek Itani, Aseem Gauri ... · arXiv
Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We int...
Yinghao Ma, Haiwen Xia, Hewei Gao ... · arXiv
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under ...
Sen Zhang, Jianguo Wei, Wenhuan Lu ... · ICASSP 2026
The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is pr...
Jinhan Xu, Xing Tang, Houpeng Yang ... · arXiv
Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to s...
Friday, February 27, 2026
Heinrich Dinkel, Xingwei Sun, Gang Li ... · arXiv
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this ...
Keita Goto, Takashi Maekaku, Jin Sakuma ... · ICASSP 2026
Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to eac...
Thursday, February 26, 2026
Zeyu Xie, Chenxing Li, Qiao Jin ... · arXiv
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discrimi...
Trung Dang, Sharath Rao, Ananya Gupta ... · arXiv
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are sig...
Sanjid Hasan, Risalat Labib, A H M Fuad ... · arXiv
Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, w...
Wednesday, February 25, 2026
Songjun Cao, Yuqi Li, Yunpeng Luo ... · arXiv
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains...
Yuzhu Wang, Archontis Politis, Konstantinos Drossos ... · IEEE Transactions on Audio, Speech and Language Processing
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both featu...
Yuxuan Chen, Peize He, Haoyuan Xu ... · arXiv
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task tr...
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen ... · LREC 2026
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien...