Audio ML Papers

Last 7 Days (May 06 - May 13, 2026)

Subcategories: All (43) | Speech Synthesis (5) | Music Synthesis (7) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (4) | Asr (1) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (23)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Yeongtak Oh, Dongwook Lee, Sangkwon Park ... · arXiv
While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor ...
#2 TOP PAPER (Score: 86)
Zijun Cui, Xiulong Liu, Hao Fang ... · arXiv
Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benc...
#3 TOP PAPER (Score: 84)
Xiaoming Ren, Ru Zhen, Chao Li ... · arXiv
Inspired by the development of OpenClaw, there is a growing demand for mobile-based personal agents capable of handling complex and intuitive interactions. In this technical report, we introduce X-OmniClaw, a unified mobile agent designed for multimodal understanding and interact...
Monday, May 11, 2026
Konstantinos Soiledis, Maximos Kaliakatsos-Papakostas, Dimos Makris ... · arXiv
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity...
Qijie You, Hao Liang, Mingrui Chen ... · arXiv
As video becomes increasingly central to information dissemination and multimodal large language models (MLLMs) continue to advance, evaluating video retrieval has become increasingly important. In realistic search scenarios, this requires matching short user queries to long-form...
Haowen Li, Tianxiang Li, Yi Yang ... · ICML 2026
The advancement of diffusion-based text-to-music generation has opened new avenues for zero-shot music editing. However, existing methods fail to achieve stem-specific timbre transfer, which requires altering specific stems while strictly preserving the background accompaniment. ...
Ege Erdem, Shoichi Koyama, Tomohiko Nakamura ... · arXiv
Reconstructing a 3D sound field from sparse microphone measurements is a fundamental yet ill-posed problem, which we address through Acoustic Transfer Function (ATF) magnitude estimation. ATF magnitude encapsulates key perceptual and acoustic properties of a physical space with a...
Dimos Makris, András Barják, Maximos Kaliakatsos-Papakostas · 2026 IEEE World Congress on Computational Intelligence, IJCNN Track
Most recent advances in audio dereverberation focus almost exclusively on speech, leaving percussive and drum signals largely unexplored despite their importance in music production. Percussive dereverberation poses distinct challenges due to sharp transients and dense temporal s...
Piotr Kawa, Kornel Howil, Piotr Borycki ... · arXiv
Explainable AI (XAI) has achieved remarkable success in image classification, yet the audio domain lacks equally mature solutions. Current methods apply vision-based attribution techniques to spectrograms, overlooking fundamental differences between visual and acoustic signals. W...
Yakun Liu, Hai Luan, Dong Liu ... · 9 pages, 5 figures, IEEE conference format
In new media art creation, the mapping between vision and hearing is often subjective. As a classic carrier of sound visualization, Chladni patterns have great potential in building audio-visual mapping mechanisms. However, existing tools face pain points: high technical barriers...
Alejandro Luebs, Mithilesh Vaidya, Ishaan Kumar ... · arXiv
The performance of audio latent diffusion models is primarily governed by generator expressivity and the modelability of the underlying latent space. While recent research has focused primarily on the former, as well as improving the reconstruction fidelity of audio codecs, we de...
Sunday, May 10, 2026
Tianrui Wang, Ziyang Ma, Yizhou Peng ... · arXiv
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-dri...
Dong Yang, Yiyi Cai, Haoyu Zhang ... · arXiv
Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Ma...
Yifan Wang, Peiwu Wang, Yunxian Chi ... · ICMR 2026 (Main Track, Long Paper)
Multimodal Intent Recognition (MIR) aims to understand complex user intentions by leveraging text, video, and audio signals. However, existing approaches face two key challenges: (1) overlooking intricate cross-modal interactions for distinguishing consistent and inconsistent cue...
Leduo Chen, Junchuan Zhao, Shengchen Li · arXiv
Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipel...
Saturday, May 09, 2026
Yuxin Kong, Peng Yang, Chongbin Yi ... · IEEE ICME 2026
Text-to-image (T2I) generation using multiple conditions enables fine-grained user control on the generated image. Yet, incorporating multi-condition inputs incurs substantial computation and communication overhead, due to additional preprocessing subtasks and control optimizatio...
Tao Yu, yiming ding, Shenghua Chai ... · arXiv
Current omni-modal benchmarks mainly evaluate models under settings where multiple modalities are provided simultaneously, while the ability to start from audio alone and actively search for cross-modal evidence remains underexplored. In this paper, we introduce \textbf{Omni-Deep...
Zheng Wang, Xiaobin Rong, Hang Su ... · arXiv
Language model (LM)-based speech enhancement (SE) can generate natural-sounding speech, but under severe noise it often suffers from unreliable conditioning, leading to perceptually plausible yet linguistically incorrect outputs. To address this issue, we propose L3-SE, a noise-i...
Aishwarya Fursule, Shruti Kshirsagar, Anderson R. Avila · SMC 2026 conference
Audio deepfake detection systems are increasingly deployed in high-stakes security applications, yet their fairness across demographic groups remains critically underexamined. Prior work measures gender disparity but does not investigate where it comes from or how to fix it syste...
Friday, May 08, 2026
Qiqi He, Dichucheng Li, Xiaoheng Sun ... · ACM International Conference on Multimedia Retrieval (ICMR 2026)
Chord generation is an inherently constrained creative task that requires balancing stylistic diversity with music-theoretic feasibility. Existing approaches typically entangle candidate generation and constraint enforcement within a single model, making the diversity-feasibility...
Xiaomin Yu, Yijiang Li, Yuhui Zhang ... · arXiv
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimoda...
Hamze Hammami, Nidhal Abdulaziz · arXiv
Discovering structure in biological signals without supervision is a fundamental problem in computational intelligence, yet existing bioacoustic methods assume vocal production models or predefined semantic units, leaving non-vocal species poorly served. This work introduces BeeV...
Shilpa Chandra, Matteo Pettenò, Nicholas Evans ... · arXiv
The evaluation of voice anonymisation remains challenging. Current practice relies on automatic speaker verification metrics such as the equal error rate (EER). Performance estimates dependent on the classifier and operating point provide an incomplete or even misleading characte...
Manan Mittal, Ryan M. Corey, Diego Cuji ... · arXiv
In dynamic acoustic environments characterized by time-varying interferers and moving sources, effective beamforming requires accurately identifying stationary regions over time. Traditional Capon beamformers rely on the instantaneous ensemble covariance matrix, which is inaccess...
Yassin Terraf, Youssef Iraqi · IEEE International Conference on Multimedia and Expo (ICME) 2026
Closed-Set speaker identification aims to assign a speech utterance to one of a predefined set of enrolled speakers and requires robust modeling of speaker-specific characteristics across multiple temporal scales. While recent deep learning approaches have achieved strong perform...
Emma Coletta, Massimiliano Todisco, Michele Panariello ... · arXiv
We introduce Latent Secret Spin (LSS), a blind speech watermarking method based on geometric operations in codec latent space. Based upon orthogonal rotations to principal components, LSS induces imperceptible but detectable covariance signatures according to a pseudo-random wate...
Thursday, May 07, 2026
Harin Lee, Rainer Polak, Manuel Anglada-Tort ... · Proceedings of the Annual Meeting of the Cognitive Science Society
Music comprises two core structural components, melody and rhythm, that vary widely across cultures. Whether these components coevolve in a coupled way or follow independent trajectories remains unclear. We introduce a novel computational pipeline to extract vocal melodic pitch-i...
Yan Zhuang, Minhao Liu, Yanru Zhang ... · arXiv
Multimodal Emotion Recognition (MER) has attracted growing attention with the rapid advancement of human-computer interaction. However, different modalities exhibit substantial discrepancies in semantics, quality, and availability, leading to highly heterogeneous modality combina...
Weilong Huang, Le Nhat Tam Huynh, Oliver Thiergart ... · arXiv
Recently, neural directional filtering (NDF) has been introduced as a flexible approach for reconstructing a virtual directional microphone (VDM) with a desired directivity pattern for spatial sound capture. Building on this idea, we propose NDF+, which enables joint neural direc...
Wonwoo Jeong · arXiv
In audio generation evaluation, Fréchet Audio Distance (FAD) is a 2-Wasserstein distance with structural constraints for both primitives: the cost is a frozen embedding pullback whose invariance set hides severe artifacts, and the coupling is a Gaussian fit that dilutes rank-1 co...
Ilya Borovik · Transactions of the International Society for Music Information Retrieval, 9(1), 144-163, 2026 · Transactions of the International Society for Music Information Retrieval
Symbolic music datasets with matched scores and performances are essential for many music information retrieval (MIR) tasks. Yet, existing resources often cover a narrow range of composers, lack performance variety, omit note-level alignments, or use inconsistent naming formats. ...
Julius Richter, Yoshiki Masuyama, Christoph Boeddeker ... · arXiv
We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to br...
Amir Ivry · arXiv
Large audio language models (LALMs) are increasingly used to reason over long audio clips, yet deployment often compresses audio before inference to reduce memory and latency. The risk is that compression can leave aggregate accuracy acceptable while sharply degrading answers for...
Guanrou Yang, Tian Tan, Qian Chen ... · arXiv
Integrating speech understanding and generation is a pivotal step toward building unified speech models. However, the different representations required for these two tasks currently pose significant compatibility challenges. Typically, semantics-oriented features are learned fro...
Rixi Xu, Qingyu Liu, Haitao Li ... · arXiv
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified represe...
Lisan Al Amin, Rakib Hossain, Mahbubul Islam ... · arXiv
Quantum machine learning has emerged as a promising tool for pattern recognition, yet many audio-focused approaches still treat spectrograms as generic images and do not explicitly exploit their time-frequency structure. We propose Q-Patch, a quantum feature map tailored to audio...
Wednesday, May 06, 2026
Cyril Allauzen, Tom Bagby, Georg Heigold ... · arXiv
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a sing...
Rajeshwar Tripathi, Sandeep Kumar, Monika Aggarwal ... · arXiv
This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, whi...
Leying Zhang, Bowen Shi, Haibin Wu ... · arXiv
The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional f...
Dongheon Lee, Ashutosh Pandey, Sanjeel Parekh ... · arXiv
While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, ...
Yangchen Yu, Qian Chen, Jia Li ... · arXiv
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-...
Xuanhao Zhang, Chang Li · arXiv
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remai...
Yukun Chen, Tianrui Wang, Zhaoxi Mu ... · arXiv
High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly n...