Audio ML Papers

Last 7 Days (May 26 - June 02, 2026)

Subcategories: All (52) | Speech Synthesis (11) | Music Synthesis (9) | Ambient Synthesis (3) | Quality Evaluation (0) | Enhancement (4) | Asr (3) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (21)
← Previous Week | Current Week

๐Ÿ† Top Papers This Week

#1 TOP PAPER (Score: 92)
Zhaoqing Li, Haoning Xu, Jingran Su ... ยท arXiv
We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing...
#2 TOP PAPER (Score: 88)
Chen Yang, Chufan Yu, Hanfu Chen ... ยท arXiv
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality...
#3 TOP PAPER (Score: 88)
Yufei Shi, Qian Chen, Wen Wang ... ยท ACL 2026
We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-...
Monday, June 01, 2026
Wenze Ren, Ke-Han Lu, Kai-Wei Chang ... ยท arXiv
Deep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurr...
Jiashuo Yu, Yao Yao, Boyu Chen ... ยท arXiv
We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular a...
Ding Ma, Jinyi Mi, Fengji Li ... ยท IEEE Transactions on Biomedical Engineering, Early Access, 2026 ยท IEEE Transactions on Biomedical Engineering
Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligi...
Ziqi Ma, Mengyu Han, Anteng Cai ... ยท arXiv
Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional...
Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine ยท arXiv
Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this doma...
Yuhang Dai, Haopeng Lin, Zhennan Lin ... ยท arXiv
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-tak...
Hanlin Zhang, Daxin Tan, Dehua Tao ... ยท arXiv
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing ben...
Nishchay Nilabh, Neeraj Kumar Sharma ยท arXiv
Speakers in dialogue continuously adapt their communicative behavior across acoustic, lexical, and semantic dimensions, a phenomenon known as conversational entrainment. Modeling this process requires representations that capture the global structure of interaction, yet prior app...
SooHwan Eom, Mark Hasegawa-Johnson, ad Chang D. Yoo ยท Interspeech 2025
Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context in...
Seonghyeon Go, Yumin Kim ยท arXiv
As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse ...
Jagabandhu Mishra, Tomi H. Kinnunen ยท arXiv
Kinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodo...
Christian H. Kasess, Wolfgang Kreuzer, Holger Waubke ยท arXiv
The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used...
Matthew Maciejewski, Samuele Cornell ยท arXiv
Speech denoising is an often necessary step not only for human listening, but also for downstream processing by systems lacking robustness to noisy, real-world acoustic conditions. Unfortunately, denoising is a problem where conventional in-domain supervised training is not trivi...
Sunday, May 31, 2026
Thรฉo Charlot, Tarek Kunze, Kaveri K. Sheth ... ยท arXiv
Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address thes...
Augusto Camargo, Marcelo Finger ยท arXiv
Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally hetero...
Michael Taenzer ยท IEEE 28th International Workshop on Multimedia Signal Processing (MMSP)
Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered...
Saturday, May 30, 2026
Sukru Samet Dindar, Riki Shimizu, Xilin Jiang ... ยท arXiv
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inf...
Xinwei Cao, Mengxuan Lu, Torbjรธrn Svendsen ... ยท arXiv
We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the densit...
Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams ... ยท arXiv
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP),...
Friday, May 29, 2026
Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas ... ยท EUSIPCO 2026 (34th European Signal Processing Conference)
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of th...
Guangyin Bao, Taiping Zeng, Jianfeng Feng ... ยท arXiv
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive ...
Kevin Everson, Mari Ostendorf ยท ACL 2026
Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propo...
Thursday, May 28, 2026
Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu ... ยท arXiv
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies ...
Tiantian Feng, Anfeng Xu, Xuan Shi ... ยท arXiv
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, ...
Yonggang Zhu, Liting Gao, Aidong Men ... ยท arXiv
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Exis...
Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha ... ยท arXiv
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper an...
Pedro H. L. Leite, Pedro Benevenuto Valadares, Luiz W. P. Biscainho ยท XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026)
Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reli...
Bohan Li, Shi Lian, Hankun Wang ... ยท arXiv
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architect...
Sung-Lin Yeh, Wei Zhou, Gil Keren ... ยท arXiv
Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce ...
Heejoon Koo, Yoon Tae Kim, Miika Toikkanen ... ยท arXiv
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced de...
Zhaoyan Pan, Xiangdong Li, Wenke Wu ... ยท arXiv
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonver...
Xiangyu Zhang, Yuxin Li, Haoyang Zhang ... ยท arXiv
The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality....
S. Sutharya, Remya K. Sasi ยท arXiv
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguis...
Wallace Abreu, Luiz W. P. Biscainho ยท arXiv
Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMa...
Wednesday, May 27, 2026
Jiahao Mei, Heinrich Dinkel, Yadong Niu ... ยท arXiv
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audi...
Jashin Ye, Dongxiao Wang, Yixuan Ye ... ยท arXiv
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenate...
Changhao Pan, Rui Yang, Han Wang ... ยท ACL 2026 (Findings)
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test...
Chong Jing, Zitong Lan, Junan Zhang ... ยท arXiv
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR pr...
Zhisheng Zhang, Xiang Li, Yixuan Zhou ... ยท arXiv
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which incr...
Yuyue Wang, Xihua Wang, Xin Cheng ... ยท arXiv
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs an...
Ryan Fosdick ยท arXiv
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoisin...
Yucheng Wang, Jing Peng, Hanqi Li ... ยท arXiv
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence ac...
Tuesday, May 26, 2026
Jingping Fang, Lin Chen, Chenyang Xu ... ยท arXiv
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditi...
Jiacheng Pang, Ashutosh Chaubey, Mohammad Soleymani ยท ICML 2026
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 v...
Xiao-Hang Jiang, Yang Ai, Hui-Peng Du ... ยท IEEE Transactions on Audio, Speech and Language Processing
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neu...
Xudong Lu, Xueying Li, Annan Wang ... ยท arXiv
We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual s...
Bowen Li, Shaotong Guo, Zhen Wang ... ยท arXiv
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregr...
Yang Xiao, Siyi Wang, Han Yin ... ยท arXiv
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, ...
Abhinaba Roy, Junyi Liang, Dorien Herremans ยท arXiv
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework fo...