Audio ML Papers

Last 7 Days (May 01 - May 08, 2026)

Subcategories: All (38) | Speech Synthesis (6) | Music Synthesis (10) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (3) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (15)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Xiaoda Yang, Majun Zhang, Changhao Pan ... · arXiv
Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: mu...
#2 TOP PAPER (Score: 91)
Davide Marincione, Michele Mancusi, Giorgio Strano ... · arXiv
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state...
#3 TOP PAPER (Score: 83)
Venkata Pushpak Teja Menta · arXiv
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, H...
Wednesday, May 06, 2026
Cyril Allauzen, Tom Bagby, Georg Heigold ... · arXiv
The Massive Sound Embedding Benchmark (MSEB) has emerged as a standard for evaluating the functional breadth of audio models. While initial baselines focused on specialized encoders, the shift toward "audio-native" Large Language Models (LLMs) suggests a new paradigm where a sing...
Rajeshwar Tripathi, Sandeep Kumar, Monika Aggarwal ... · arXiv
This study presents a bio inspired signal processing framework for robust Underwater Acoustic Target Recognition (UATR). The latest state of the art methods often fail to resolve dense low frequency harmonic structures in vessel propulsion signals under high noise conditions, whi...
Leying Zhang, Bowen Shi, Haibin Wu ... · arXiv
The rapid advancement of generative audio models has outpaced the development of robust evaluation methodologies. Existing objective metrics and general multimodal large language models (MLLMs) often struggle with domain generalization, zero-shot capabilities, and instructional f...
Dongheon Lee, Ashutosh Pandey, Sanjeel Parekh ... · arXiv
While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, ...
Yangchen Yu, Qian Chen, Jia Li ... · arXiv
Multimodal emotion recognition (MER) benefits from combining text, audio, and vision, yet standard fusion often fails when modalities conflict. Crucially, conflicts differ in resolvability: benign conflicts stem from missing, weak, or ambiguous cues and can be mitigated by cross-...
Xuanhao Zhang, Chang Li · arXiv
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remai...
Yukun Chen, Tianrui Wang, Zhaoxi Mu ... · arXiv
High-quality singing annotations are fundamental to modern Singing Voice Synthesis (SVS) systems. However, obtaining these annotations at scale through manual labeling is unrealistic due to the substantial labor and musical expertise required, making automatic annotation highly n...
Tuesday, May 05, 2026
Jaavid Aktar Husain, Dorien Herremans · arXiv
Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs ...
Ragib Amin Nihal, Benjamin Yen, Runwu Shi ... · arXiv
Training data for bioacoustics is scattered across taxa, regions, and institutions. Centralizing it all is often infeasible. We show that independently fine-tuned BEATs encoders can be composed into a unified 661-species classifier via task vector arithmetic without sharing data....
Zijian Zhao, Dian Jin, Zijing Zhou ... · arXiv
Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limita...
P. H. Hai, L. T. Minh, L. H. Son · arXiv
Sound Event Detection (SED) plays a vital role in audio understanding, with applications in surveillance, smart cities, healthcare, and multimedia indexing. However, conventional SED systems operate under a closed-world assumption, limiting their effectiveness in real-world envir...
Jingyao Gong · arXiv
MiniMind-O is an open 0.1B-scale omni model built on the MiniMind language model. It accepts text, speech, and image inputs, and returns both text and streaming speech. The release includes model code, checkpoints, and the main Parquet training datasets for text-to-audio, image-t...
Monday, May 04, 2026
Sandra Arcos-Holzinger, Sarah M. Erfani, James Bailey ... · arXiv
Self-supervised speech models (S3Ms) achieve strong downstream performance, yet their learned representations remain poorly understood under natural and adversarial perturbations. Prior studies rely on representation similarity or global dimensionality, offering limited visibilit...
Jim O'Regan, Jens Edlund · Odyssey 2026
Speech encodes multiple simultaneous attributes--linguistic content, speaker identity, dialect, gender--that conventional single-vector embeddings conflate. We present a factor-partitioned embedding framework that maps each utterance into a single vector whose subspaces corresp...
Ahsan Jamal Cheema · arXiv
Vocal hyperfunction (VH) is a prevalent voice disorder whose ambulatory detection remains challenging despite extensive daily voice data. Prior approaches capture week-long neck-surface accelerometer recordings but collapse them into fixed-length subject-level feature vectors, di...
Vamshi Nallaguntla, Shruti Kshirsagar, Anderson R. Avila · arXiv
Recent advances in emotional voice conversion (EVC) have enabled the generation of expressive synthetic speech, raising new concerns in audio deepfake detection. Existing approaches treat speech as a homogeneous signal and largely overlook its internal phonetic structure, limitin...
Jiaxu He, Chao Wang, Jie Lian ... · arXiv
Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-mod...
Yaxuan Wang, Tianxin Li, Enji Liang ... · ICME 2026 Workshop
Periodic patterns are fundamental cues in multimedia signals and systems, including repetitive motion in video (e.g., gait cycles), rhythmic and pitch-related structure in audio, and recurring textures in image sequences. When such user-generated streams are collected from edge d...
Yadi Wen, Tianxin Li, Enji Liang ... · ICME 2026 Workshop
We study example-level private supervised speech classification under a practical release constraint: training may access privileged side information, but the released model must be audio-only. This setting is important because speech systems can often exploit richer side informa...
Justice Owusu Agyemang, Jerry John Kponyo, Kwame Opuni-Boachie Obour Agyekum ... · arXiv
We present the Streaming Reservoir Convergence Theorem (SRCT), a novel mathematical framework for multi-provider adaptive bitrate streaming that addresses three fundamental structural weaknesses in current systems: linear provider probing, reactive failover, and cold standby tran...
Tung Vu, Yen Nguyen, Hai Nguyen ... · arXiv
Recent advances in voice cloning and text-to-speech synthesis have made partial speech manipulation - where an adversary replaces a few words within an utterance to alter its meaning while preserving the speaker's identity - an increasingly realistic threat. Existing audio deepfa...
Sunday, May 03, 2026
Xinmeng Xu, Haoran Xie, S. Joe Qin ... · arXiv
Stage-wise audio-visual encoders propagate fused intermediate states across layers, making the formation of later representations depend on the readiness of earlier fusion states. Strong local audio-visual agreement provides useful correspondence evidence, yet a fused state also ...
Jiafeng Liu, Yuanliang Dong, Hongjia Liu ... · arXiv
A common design pattern in high-quality music generation is to handle structure and fidelity in different representation spaces: a generator first models high-level structure, followed by diffusion-based or neural decoding stages that reconstruct fine details. In this work, we ex...
Saturday, May 02, 2026
Yutong Jin, Qi Li, Lingshuang Liu ... · ACISP 2026
In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the g...
Ke Qiu, Yawen Qin, Tianzhi Jia ... · arXiv
Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-m...
Yimeng Zhang, Yueru Sun, Haoyu Gu · arXiv
Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on st...
Yi-Cheng Lin, Yun-Shao Tsai, Kuan-Yu Chen ... · arXiv
Speech technologies are deployed in high-stakes settings, yet fairness concerns remain fragmented across tasks and disciplines. Existing surveys either adopt a general machine-learning perspective that overlooks speech-specific properties or focus on a single task, missing failur...
Mayesha Maliha R. Mithila, Mylene C. Q. Farias · ICIP 2026
Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA m...
Friday, May 01, 2026
Kuan-Po Huang, Bo-Ru Lu, Byeonggeun Kim ... · arXiv
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines ...
Zuyao You, Zhesong Yu, Mingyu Liu ... · arXiv
In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and langu...
Kazuya Tateishi, Akira Takahashi, Atsuo Hiroe ... · CVPR 2026 Sight and Sound Workshop
Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straight...
Harshit Rajgarhia, Shuubham Ojha, Asif Shaik ... · ICML 2026
We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain exp...
Beining Wu, Zihao Ding, Jun Huang · arXiv
While current federated multimodal continual learning over mixture-of-experts low-rank adaptation (MoE-LoRA) is built on the unverified assumption that routing isolates task-specific knowledge into disjoint experts, we argue that routing operates per-sample, while forgetting accu...
Yawen Qin, Ke Qiu, Qin Zhang · arXiv
Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic inte...
Ziyi Yang, Zhengding Luo, Yisong Zou ... · INTER-NOISE 2026
To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generatio...