Audio ML Papers

Last 7 Days (May 05 - May 12, 2026)

Subcategories: All (27) | Speech Synthesis (4) | Music Synthesis (4) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (14)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 93)
You Qin, Kai Liu, Shengqiong Wu ... · arXiv
Audio-Visual Intelligence (AVI) has emerged as a central frontier in artificial intelligence, bridging auditory and visual modalities to enable machines that can perceive, generate, and interact in the multimodal real world. In the era of large foundation models, joint modeling o...
#2 TOP PAPER (Score: 91)
Davide Marincione, Michele Mancusi, Giorgio Strano ... · arXiv
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state...
#3 TOP PAPER (Score: 86)
Zijun Cui, Xiulong Liu, Hao Fang ... · arXiv
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benc...
Friday, May 08, 2026
Qiqi He, Dichucheng Li, Xiaoheng Sun ... · ACM International Conference on Multimedia Retrieval (ICMR 2026)
Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility...
Shilpa Chandra, Matteo Pettenò, Nicholas Evans ... · arXiv
The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characte...
Thursday, May 07, 2026
Harin Lee, Rainer Polak, Manuel Anglada-Tort ... · Proceedings of the Annual Meeting of the Cognitive Science Society
Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-i...
Yan Zhuang, Minhao Liu, Yanru Zhang ... · arXiv
Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combina...
Weilong Huang, Le Nhat Tam Huynh, Oliver Thiergart ... · arXiv
Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural direc...
Wonwoo Jeong · arXiv
In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 co...
Ilya Borovik · Transactions of the International Society for Music Information Retrieval, 9(1), 144-163, 2026 · Transactions of the International Society for Music Information Retrieval
Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. ...
Julius Richter, Yoshiki Masuyama, Christoph Boeddeker ... · arXiv
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to br...
Amir Ivry · arXiv
Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for...
Guanrou Yang, Tian Tan, Qian Chen ... · arXiv
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned fro...
Rixi Xu, Qingyu Liu, Haitao Li ... · arXiv
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified represe...
Lisan Al Amin, Rakib Hossain, Mahbubul Islam ... · arXiv
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio...
Wednesday, May 06, 2026
Cyril Allauzen, Tom Bagby, Georg Heigold ... · arXiv
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a sing...
Rajeshwar Tripathi, Sandeep Kumar, Monika Aggarwal ... · arXiv
This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, whi...
Leying Zhang, Bowen Shi, Haibin Wu ... · arXiv
The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional f...
Dongheon Lee, Ashutosh Pandey, Sanjeel Parekh ... · arXiv
While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, ...
Yangchen Yu, Qian Chen, Jia Li ... · arXiv
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-...
Xuanhao Zhang, Chang Li · arXiv
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remai...
Yukun Chen, Tianrui Wang, Zhaoxi Mu ... · arXiv
High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly n...
Tuesday, May 05, 2026
Jaavid Aktar Husain, Dorien Herremans · arXiv
Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs ...
Ragib Amin Nihal, Benjamin Yen, Runwu Shi ... · arXiv
Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data....
Zijian Zhao, Dian Jin, Zijing Zhou ... · arXiv
Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limita...
P. H. Hai, L. T. Minh, L. H. Son · arXiv
Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world envir...
Jingyao Gong · arXiv
MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-t...