Audio ML Papers

Last 7 Days (May 20 - May 27, 2026)

Subcategories: All (15) | Speech Synthesis (3) | Music Synthesis (4) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (0) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (5)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Haoyang Zhang, Jun Chen, Donghang Wu ... · arXiv
Recent advances in spoken dialogue language models have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and too...
#2 TOP PAPER (Score: 86)
Yifan Dai, Zhenhua Wu, Bohan Zeng ... · arXiv
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) c...
#3 TOP PAPER (Score: 86)
Bin Lin, Bo Zhao, Boyong Wu ... · arXiv
Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems acro...
Thursday, May 21, 2026
Hao Jiang, Edgar Choueiri · arXiv
Coordinate-conditioned neural networks can generate head-tracked personal sound zone (PSZ) loudspeaker filters in real time, but they are sensitive to localization uncertainty. Small fluctuations in estimated listener coordinates, caused by optical distortion, temporary occlusion...
Zhiqi Ai, Han Cheng, Shiyi Mu ... · TASLP
User-defined keyword spotting (KWS) is crucial for personalized voice interaction, yet existing methods face several challenges: (1) insufficient discriminability among confusable words, (2) performance inconsistency across speakers with varying pronunciations, and (3) high data ...
Zachary Novack, Stephen Brade, Haven Kim ... · arXiv
Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. ...
Jinhyeok Yang, Hyeongju Kim, Yechan Yu ... · arXiv
While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves a...
Diep Luong, Konstantinos Drossos, Mikko Heikkinen ... · arXiv
Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising sy...
Wednesday, May 20, 2026
Hongrui Zhang, Daiqing Wu, Yangyang Li ... · IJCAI 2026 Survey Track
Multimodal Emotion Recognition (MER) focuses on identifying and interpreting emotions from modality-compound inputs. Closely mirroring human cognitive processes in real-world environments, MER has drawn substantial attention from both academia and industry. Recently, a paradigm s...
Zhihan Guo, Wenqian Cui, Guan-Ting Lin ... · arXiv
Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, ...
Fang-Chih Hsieh, Wei-Jaw Lee, Chun-Ping Wang ... · IEEE ICME 2026 Grand Challenge Paper
This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietar...
Junyoung Koh · arXiv
Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isola...
Ilai Zaidel, Ori Engel, Bar Engel ... · arXiv
We propose a deep beamforming framework for enhancing target speaker(s) in multi-speaker environments. A deep neural network (DNN) is trained to estimate beamforming weights directly from noisy multichannel inputs while satisfying linear spatial constraints through an adaptive mu...
Semin Kim, Seungjun Chung, Taehong Moon ... · arXiv
Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-o...
Shinnosuke Taksuka, Hideo Mukai · arXiv
This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as...