Audio ML Papers

Last 7 Days (June 10 - June 17, 2026)

Subcategories: All (72) | Speech Synthesis (14) | Music Synthesis (8) | Ambient Synthesis (0) | Quality Evaluation (0) | Enhancement (4) | Asr (8) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (35)
← Previous Week | Current Week

🏆 Top Papers This Week

#1 TOP PAPER (Score: 88)
Junlong Tong, Wenqi Xu, Yingqi Fan ... · arXiv
Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a con...
#2 TOP PAPER (Score: 84)
Yonghyun Kim, Junwon Lee, Haiwen Xia ... · arXiv
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) vote...
#3 TOP PAPER (Score: 84)
Salman Hussain Ali, Umberto Cappellazzo, Mirco Ravanelli · Interspeech 2026
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Mea...
Tuesday, June 16, 2026
Alexander Polok, Samuele Cornell, Sathvik Udupa ... · Interspeech 2026
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization ...
Monday, June 15, 2026
Hyung Kyu Kim, Byungchan Hwang, Hak Gu Kim · Interspeech26
Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textit{ArtBoost}, a novel data augmentation strategy that leverages large-scale speech--mesh da...
Zeqian Hu, Fuliang Weng, Shu Shang ... · Interspeech 2026
Zero-shot cross-lingual phoneme recognition is often hindered by the fragility of direct acoustic-to-symbol mapping, which is susceptible to language-specific variations. Echoing joint-embedding predictive architecture (JEPA) work in vision, we propose ArtNet, a framework that ex...
Yan Han, Zhibin Wen, Yuan Wang ... · arXiv
The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sof...
Chengxi Deng, Xurong Xie, Shujie Hu ... · Interspeech 2026
This paper proposes a novel confidence score guided incremental and speaker adaptive pseudo-labeling approach for semi-supervised elderly speech recognition. It facilitates higher-quality pseudo-label selection and progressive refinement, while also mitigating speaker heterogenei...
Dong Yang, Yuki Saito, Wataru Nakata ... · arXiv
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level ...
Chengxi Deng, Xurong Xie, Shujie Hu ... · Interspeech 2026
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utteran...
Zhuodong Liu, Hugen Lv, Xiangyu Li ... · Interspeech 2026
Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper pro...
Ruchi Pandey, Jaime Garcia-Martinez, Pablo Cabanas-Molero ... · arXiv
Microphone bleed is a persistent challenge in small ensembles and orchestral recordings, where close microphones intended for individual instruments also capture leakage from nearby sources. This overlap degrades track isolation and complicates mixing. This paper addresses the bl...
Haotian Qi, Gabriel Skantze · arXiv
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic ...
Alex Gichamba, Moise Busogi · Interspeech 2026
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate d...
Haixin Zhao, Nilesh Madhu · Interspeech 2026
This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous...
Haocheng Dong, Yuheng Lu, Cheng Gong ... · arXiv
With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges,...
Xintong Wang, Ye Wang · arXiv
Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches ...
Zhiqi Ai, Han Cheng, Shiyi Mu ... · Interspeech 2026
Short-duration speaker verification (SDSV) is crucial for personalized keyword spotting, where test utterances are typically shorter than three seconds. Limited speech duration results in unstable speaker representations and increased sensitivity to noise and phoneme variations, ...
Sunday, June 14, 2026
Hyebin Cho, Jaehyuk Jang, Changick Kim ... · INTERSPEECH 2026
Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, lea...
Dabin Kim, Junwon Lee, Juhan Nam · Interspeech 2026
This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a vio...
Pengfei Zhang, Hoang H Nguyen, Yutong Song ... · arXiv
Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings...
Orchid Chetia Phukan, Girish, Mohd Mujtaba Akhtar ... · IJCAI-ECAI 2026
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, caus...
Changda Chen, Yichen Yang, Wei Liu ... · Interspeech 2026
This paper proposes a geometrically constrained decentralized independent vector analysis (GC-Dec-IVA) method for distributed microphone arrays. Recently proposed Dec-IVA method enables source separation by exchanging only power-related statistics to exploit cross-array informati...
Jialong Mai, Jinxin Ji, Xiaofen Xing ... · arXiv
Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears ...
Hangling Xie · arXiv
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently sub...
Saturday, June 13, 2026
Liming Wang, Cody Karjadi, Rhoda Au ... · arXiv
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held ...
Zhongyuan Fu · arXiv
We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new pro...
Zhenwei Mou, Weili Jiang, Liping Chen ... · INTERSPEECH 2026
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking...
Zhenwei Mou, Liping Chen, Yajun Hu ... · INTERSPEECH 2026
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient ...
Yuxuan Jiang, Mingyang Han, Yusheng Dai ... · Interspeech 2026
Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a train...
Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen · arXiv
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phonem...
Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee · arXiv
The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however,...
Farnaz Sedaghati, Yuxi Wang, Zicheng Weng ... · Interspeech 2026
With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares diffe...
Yu Liu, Zhiwei Yang, Wenxiao Zhang ... · arXiv
A model can learn that the piano piece FĂĽr Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures wh...
Siyuan Zhang, Jian Zong, Junyu Wang ... · Interspeech 2026
While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better r...
Friday, June 12, 2026
Hui Geng, Yi Su, Han Yin ... · arXiv
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quali...
Youjun Chen, Xurong Xie, Mengzhe Geng ... · Interspeech 2026
Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and ...
Shiyao Wang, Xijuan Zeng, Hui Wang ... · INTERSPEECH 2026 · INTERSPEECH 2026
We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control bu...
Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang ... · arXiv
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing sev...
Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida ... · arXiv
Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage...
Yuxuan Chen, Haoyuan Yu, Peize He · INTERSPEECH 2026
Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate thi...
Minjae Lee, Hee-Soo Heo, Youngki Kwon ... · arXiv
We present target speaker tagging (TST), a task that integrates speaker diarization, verification, and identification into a unified workflow for multi-speaker conversations. Given long recordings and pre-enrolled speakers, TST detects and labels speech segments of known speakers...
Hugo Daumain, Driss Matrouf, Khaled Khelif ... · Odyssey 2026 (The Speaker and Language Recognition Workshop)
Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we tr...
Alef Iury Siqueira Ferreira, Lucas Rafael Stefanel Gris, Luiz Fernando de Araújo Vidal ... · arXiv
Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discret...
Ayoub Elkhouzari, Youssef Iraqi, Loubna Mekouar · arXiv
We present MaskedFOP, a system for closed-set polyglot speaker identification under two simultaneous challenges: the face modality is entirely absent at test time, and speech comes from Urdu, a language unseen during face-supervised training. The system integrates three complemen...
Chen Ying Claude, Zhihan Luo · arXiv
We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, disson...
Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski · ICML 2026 Workshop on Machine Learning for Audio
This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to dec...
Thursday, June 11, 2026
Soumyajit Mitra, Prabhat Pandey, Abhinav Jain ... · Interspeech 2026
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explic...
Naijun Zheng, Yuke Lin, Sanli Tian ... · Interspeech 2026
Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require larg...
Yihang Lin, Li Zhou, Congwei Cao ... · IJCAI 2026
Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity contro...
Sathvik Udupa, Shinji Watanabe, Petr Schwarz ... · Interspeech 2026
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based...
Fengrui Liu, Ruiyang Huang, Qijian Zheng ... · ACM ICMR 2026
Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural syn...
Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo ... · Proceedings of IEEE ICASSP 2026 · Proceedings of IEEE ICASSP 2026
Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly ...
Xiang Li, Yixuan Zhou, Jingran Xie ... · ICML 2026
Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream ...
Nithya Shikarpur, Victor Arul, Anna Huang · NIME music track 2026
Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains inc...
Wednesday, June 10, 2026
Zeyue Tian, Lei Ke, Zhaoyang Liu ... · arXiv
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step dif...
Damien Martins Gomes, François Capman · arXiv
Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-...
Jun Xu, Zhengxue Cheng, Fengxi Zhang ... · arXiv
Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning f...
Eungbeom Kim, Kyogu Lee · Interspeech 2026
Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM ...
Haiyun Li, Shuhai Peng, Zhisheng Zhang ... · ICME2026
Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction mod...
Hemansh Shridhar, Miika Toikkanen, June-Woo Kim · Interspeech 2026
Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce s...
Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang ... · arXiv
Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original vid...
Bowen Zheng, Andrew H. Yang, Jiaqi Ruan ... · RTAS 2026
Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synth...
Peijie Chen, Wenhao Guan, Weijie Wu ... · Interspeech 2026
Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens...
Abhirup Saha, Hans-Ulrich Berendes, Meinard Müller ... · International Computer Music Conference (ICMC) 2026
Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level sc...
Shota Horiguchi, Marc Delcroix, Naohiro Tawara ... · Interspeech 2026 (Long Paper Track)
Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on s...
Zhen Ye, Xu Tan, Yiming Li ... · Interspeech 2026 long paper
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text und...
Peng Jia, Li Dai, Jia Li ... · arXiv
Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model s...
Yoon Tae Kim, Heejoon Koo, Miika Toikkanen ... · Interspeech 2026
We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and ro...
Qingfeng Zhang, Yuanxiong Guo, Yanmin Gong · IEEE International Conference on Healthcare Informatics, 2026
Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents Neur...
Purnima Kamath, Adrian S. Roman, Koichi Saito ... · Interspeech 2026
Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spa...
Haoning Xu, Zhaoqing Li, Huimeng Wang ... · Interspeech 2026
This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducte...