Audio ML Papers

Last 7 Days (May 30 - June 06, 2026)

Subcategories: All (24) | Speech Synthesis (3) | Music Synthesis (3) | Ambient Synthesis (1) | Quality Evaluation (0) | Enhancement (2) | Asr (2) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (11)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 93)
Ziyang Ma, Ruiqi Yan, Ruiyang Xu ... · arXiv
We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visu...
#2 TOP PAPER (Score: 91)
Samiul Alam, Shakhrul Iman Siam, Michael J. Proulx ... · arXiv
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over ...
#3 TOP PAPER (Score: 88)
Chen Yang, Chufan Yu, Hanfu Chen ... · arXiv
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality...
Tuesday, June 02, 2026
Marco Pasini, Javier Nistal, Mathias Rose Bjare ... · arXiv
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using ad...
Monday, June 01, 2026
Wenze Ren, Ke-Han Lu, Kai-Wei Chang ... · arXiv
Deep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurr...
Yufei Shi, Qian Chen, Wen Wang ... · ACL 2026
We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-...
Jiashuo Yu, Yao Yao, Boyu Chen ... · arXiv
We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular a...
Ding Ma, Jinyi Mi, Fengji Li ... · IEEE Transactions on Biomedical Engineering, Early Access, 2026 · IEEE Transactions on Biomedical Engineering
Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligi...
Ziqi Ma, Mengyu Han, Anteng Cai ... · arXiv
Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional...
Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine · arXiv
Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this doma...
Yuhang Dai, Haopeng Lin, Zhennan Lin ... · arXiv
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-tak...
Hanlin Zhang, Daxin Tan, Dehua Tao ... · arXiv
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing ben...
Nishchay Nilabh, Neeraj Kumar Sharma · arXiv
Speakers in dialogue continuously adapt their communicative behavior across acoustic, lexical, and semantic dimensions, a phenomenon known as conversational entrainment. Modeling this process requires representations that capture the global structure of interaction, yet prior app...
SooHwan Eom, Mark Hasegawa-Johnson, ad Chang D. Yoo · Interspeech 2025
Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context in...
Seonghyeon Go, Yumin Kim · arXiv
As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse ...
Jagabandhu Mishra, Tomi H. Kinnunen · arXiv
Kinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodo...
Christian H. Kasess, Wolfgang Kreuzer, Holger Waubke · arXiv
The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used...
Matthew Maciejewski, Samuele Cornell · arXiv
Speech denoising is an often necessary step not only for human listening, but also for downstream processing by systems lacking robustness to noisy, real-world acoustic conditions. Unfortunately, denoising is a problem where conventional in-domain supervised training is not trivi...
Sunday, May 31, 2026
Théo Charlot, Tarek Kunze, Kaveri K. Sheth ... · arXiv
Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address thes...
Augusto Camargo, Marcelo Finger · arXiv
Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally hetero...
Michael Taenzer · IEEE 28th International Workshop on Multimedia Signal Processing (MMSP)
Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered...
Saturday, May 30, 2026
Sukru Samet Dindar, Riki Shimizu, Xilin Jiang ... · arXiv
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inf...
Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen ... · arXiv
We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the densit...
Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams ... · arXiv
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP),...