Audio ML Papers

Week of June 07 - June 14, 2026

Subcategories: All (105) | Speech Synthesis (20) | Music Synthesis (9) | Ambient Synthesis (2) | Quality Evaluation (0) | Enhancement (6) | Asr (14) | Llm Audio (0) | Midi Generation (0) | Generative Conditioning (0) | Other (49)
← Previous Week | Current Week →

🏆 Top Papers This Week

#1 TOP PAPER (Score: 91)
Wenhao Guan, Yifan Duan, Junxi Liu ... · arXiv
Video dubbing is a cornerstone of multimedia content creation, aiming to synthesize synchronized acoustic sequences for visual streams. While Text-to-Speech (TTS) and Text-to-Audio (TTA) generation have each achieved remarkable progress, existing dubbing systems remain confined t...
#2 TOP PAPER (Score: 88)
Junlong Tong, Wenqi Xu, Yingqi Fan ... · arXiv
Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a con...
#3 TOP PAPER (Score: 85)
Nikita Koriagin, Georgii Aparin, Nikita Balagansky ... · arXiv
Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of Cos...
Saturday, June 13, 2026
Liming Wang, Cody Karjadi, Rhoda Au ... · arXiv
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held ...
Zhongyuan Fu · arXiv
We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new pro...
Zhenwei Mou, Weili Jiang, Liping Chen ... · INTERSPEECH 2026
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking...
Zhenwei Mou, Liping Chen, Yajun Hu ... · INTERSPEECH 2026
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient ...
Yuxuan Jiang, Mingyang Han, Yusheng Dai ... · Interspeech 2026
Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a train...
Manasi Chhibber, Jagabandhu Mishra, Tomi H. Kinnunen · arXiv
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phonem...
Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee · arXiv
The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however,...
Farnaz Sedaghati, Yuxi Wang, Zicheng Weng ... · Interspeech 2026
With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares diffe...
Yu Liu, Zhiwei Yang, Wenxiao Zhang ... · arXiv
A model can learn that the piano piece FĂĽr Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures wh...
Siyuan Zhang, Jian Zong, Junyu Wang ... · Interspeech 2026
While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better r...
Friday, June 12, 2026
Hui Geng, Yi Su, Han Yin ... · arXiv
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quali...
Youjun Chen, Xurong Xie, Mengzhe Geng ... · Interspeech 2026
Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and ...
Shiyao Wang, Xijuan Zeng, Hui Wang ... · INTERSPEECH 2026 · INTERSPEECH 2026
We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control bu...
Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang ... · arXiv
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing sev...
Oh Hyun-Bin, Kazuki Shimada, Yuhta Takida ... · arXiv
Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage...
Yuxuan Chen, Haoyuan Yu, Peize He · INTERSPEECH 2026
Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate thi...
Minjae Lee, Hee-Soo Heo, Youngki Kwon ... · arXiv
We present target speaker tagging (TST), a task that integrates speaker diarization, verification, and identification into a unified workflow for multi-speaker conversations. Given long recordings and pre-enrolled speakers, TST detects and labels speech segments of known speakers...
Hugo Daumain, Driss Matrouf, Khaled Khelif ... · Odyssey 2026 (The Speaker and Language Recognition Workshop)
Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we tr...
Alef Iury Siqueira Ferreira, Lucas Rafael Stefanel Gris, Luiz Fernando de Araújo Vidal ... · arXiv
Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discret...
Ayoub Elkhouzari, Youssef Iraqi, Loubna Mekouar · arXiv
We present MaskedFOP, a system for closed-set polyglot speaker identification under two simultaneous challenges: the face modality is entirely absent at test time, and speech comes from Urdu, a language unseen during face-supervised training. The system integrates three complemen...
Chen Ying Claude, Zhihan Luo · arXiv
We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, disson...
Piotr Kitłowski, Dominik Wiącek, Mateusz Modrzejewski · ICML 2026 Workshop on Machine Learning for Audio
This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to dec...
Thursday, June 11, 2026
Soumyajit Mitra, Prabhat Pandey, Abhinav Jain ... · Interspeech 2026
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explic...
Naijun Zheng, Yuke Lin, Sanli Tian ... · Interspeech 2026
Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require larg...
Yihang Lin, Li Zhou, Congwei Cao ... · IJCAI 2026
Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity contro...
Sathvik Udupa, Shinji Watanabe, Petr Schwarz ... · Interspeech 2026
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based...
Fengrui Liu, Ruiyang Huang, Qijian Zheng ... · ACM ICMR 2026
Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural syn...
Tomohiro Nakatani, Rintaro Ikeshita, Naoyuki Kamo ... · Proceedings of IEEE ICASSP 2026 · Proceedings of IEEE ICASSP 2026
Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly ...
Xiang Li, Yixuan Zhou, Jingran Xie ... · ICML 2026
Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream ...
Nithya Shikarpur, Victor Arul, Anna Huang · NIME music track 2026
Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains inc...
Wednesday, June 10, 2026
Zeyue Tian, Lei Ke, Zhaoyang Liu ... · arXiv
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step dif...
Damien Martins Gomes, François Capman · arXiv
Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-...
Jun Xu, Zhengxue Cheng, Fengxi Zhang ... · arXiv
Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning f...
Eungbeom Kim, Kyogu Lee · Interspeech 2026
Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM ...
Haiyun Li, Shuhai Peng, Zhisheng Zhang ... · ICME2026
Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction mod...
Hemansh Shridhar, Miika Toikkanen, June-Woo Kim · Interspeech 2026
Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce s...
Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang ... · arXiv
Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original vid...
Bowen Zheng, Andrew H. Yang, Jiaqi Ruan ... · RTAS 2026
Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synth...
Peijie Chen, Wenhao Guan, Weijie Wu ... · Interspeech 2026
Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens...
Abhirup Saha, Hans-Ulrich Berendes, Meinard Müller ... · International Computer Music Conference (ICMC) 2026
Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level sc...
Shota Horiguchi, Marc Delcroix, Naohiro Tawara ... · Interspeech 2026 (Long Paper Track)
Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on s...
Zhen Ye, Xu Tan, Yiming Li ... · Interspeech 2026 long paper
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text und...
Peng Jia, Li Dai, Jia Li ... · arXiv
Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model s...
Yoon Tae Kim, Heejoon Koo, Miika Toikkanen ... · Interspeech 2026
We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and ro...
Qingfeng Zhang, Yuanxiong Guo, Yanmin Gong · IEEE International Conference on Healthcare Informatics, 2026
Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents Neur...
Purnima Kamath, Adrian S. Roman, Koichi Saito ... · Interspeech 2026
Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spa...
Haoning Xu, Zhaoqing Li, Huimeng Wang ... · Interspeech 2026
This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducte...
Tuesday, June 09, 2026
Chengbin Liang, Wenqi Guo, Hao Cao ... · Interspeech 2026
Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (< 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message unde...
Tomoya Tanabu, Hiroshi Nishijima, Daisuke Saito ... · Interspeech 2026
We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear tra...
Cristian-Teodor Neamtu, Serban Mihalache, Stefan Smeu ... · 34th European Signal Processing Conference (EUSIPCO 2026)
The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a give...
Mohan Shi, Kaiyuan Zhang, Zilai Wang ... · Interspeech 2026
While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously....
Natarajan Balaji Shankar, Zilai Wang, Kaiyuan Zhang ... · Interspeech 2026
Transformer-based Speech Foundation Models excel in most Automatic Speech Recognition tasks but often suffer performance degradation when applied to domains with mismatched acoustic characteristics. While Parameter Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation...
Zilai Wang, Natarajan Balaji Shankar, Mohan Shi ... · Interspeech 2026
Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables sel...
Jin Li, Wenbin Jiang, Ji Hu · Interspeech 2026
User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal f...
Xueping Zhang, Han Yin, Yang Xiao ... · ICME 2026 workshop
The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the chall...
Jakob Poncelet, Hugo Van hamme · EUSIPCO 2026
Recent research has explored integrating Large Language Models (LLMs) with speech encoders to create speech-augmented LLMs capable of contextualized speech recognition. The main challenge lies in aligning the semantic embeddings of LLMs with the acoustic representations of speech...
Hongyu Jin, Siyi Wang, Yang Xiao ... · arXiv
Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current ev...
Vojtěch Staněk, Anton Firc, Jakub Reš ... · Interspeech 2026
We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves d...
Zhiyuan Zhu, Yixuan Chen, Yiwen Shao ... · arXiv
Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that...
Jakob Poncelet, Hugo Van hamme · Interspeech 2026
Speech-aware large language models (LLMs) can incorporate speech through pre-trained acoustic encoders that project speech features into the LLM embedding space. While the choice of the speech encoder critically influences performance, different encoders often exhibit complementa...
Tsung-En Lin, Hung-Yi Lee · arXiv
Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while k...
Simen Hexeberg, Fanghui Tong, Hari Vishnu ... · arXiv
Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization d...
Jakob Poncelet, Hugo Van hamme · Interspeech 2026
Speech recognition often fails on rare, domain-specific terms and context-related named entities. Existing contextualization techniques typically bias decoding with keywords or phrase lists, which does not scale well or exploit deeper knowledge. We propose a training method that ...
Khanh Le, Kiet Anh Hoang, Bao Nguyen ... · arXiv
We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a s...
Vojtěch Staněk, Veronika Jirmusová, Anton Firc ... · Interspeech 2026
Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supe...
Leonor Barreiros, Raul Monteiro, Afonso Mendes ... · Interspeech 2026
Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. H...
Danel Slabbert, Simon Malan, Herman Kamper · arXiv
Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more...
Youcef Soufiane Gheffari, Samiya Silarbi · arXiv
Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diver...
Xuanchen Li, Tianrui Wang, Yuheng Lu ... · arXiv
Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we pro...
Guodong Lin, Ziqi Chen, Yuxiang Fu ... · ICASSP (2026),18807-18811 · ICASSP (2026)
The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting ...
Monday, June 08, 2026
Dohwan Kim, Jung-Woo Choi · Interspeech 2026
While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velo...
Agneedh Basu, Pavan Kumar J, Sujith P ... · arXiv
Spoken language identification (LID) for Indian languages is a challenging problem due to the large number of languages, significant phonetic overlap among related varieties, and the scarcity of labeled data for many low-resource languages. In this work, we present a systematic c...
Zhuoyan Tao, Jiatong Shi, Hye-jin Shim ... · Interspeech 2026
While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulati...
Shiyu Li, Zhiyuan Hu, Yifan Wang ... · arXiv
Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Cona...
Eder del Blanco, David Gimeno-Gómez, Eva Navas ... · arXiv
Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide com...
Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang ... · IEEE Signal Processing Letters
Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives d...
Changfeng Gao, Yong Ren, Jun Yuan ... · arXiv
Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this pap...
Hanke Xie, Xiaming Ren, Dake Guo ... · Interspeech 2026
Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-...
Yuxuan Chen, Haoyuan Xu, Peize He · arXiv
Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioni...
Ashley R. Keaton, Zahra Khanjani, Christine Mallinson ... · arXiv
Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveragin...
Guobin Ma, Yuxuan Xia, Yuepeng Jiang ... · Interspeech 2026
Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the...
George Theodosiou, Loukas Ilias, Dimitris Askounis · arXiv
Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a pro...
Steven Vander Eeckt, Hugo Van hamme · Interspeech 2026
Speech foundation models enable strong general-purpose ASR and are attractive for downstream adaptation. However, their size and the catastrophic forgetting induced by sequential fine-tuning demand parameter-efficient and regularized training methods, motivating parameter-efficie...
Yijie Li, Jiahao Xu, Ching-Chih Tsao ... · arXiv
Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs ne...
Björn Þór Jónsson, Çağrı Erdem, Stefano Fasciani ... · This is an extended version of the previously published conference paper "Towards Sound Innovation Engines Using Pattern-Producing Networks and Audio Graphs": https://doi.org/10.1007/978-3-031-56992-0_14
This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for ...
Shakhrul Iman Siam, Tiantian Feng, Jiankun Zhang ... · ACL 2026 Main Conference
Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approa...
Zhu Li, Shekhar Nayak, Matt Coler · Interspeech 2026
Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains chall...
Sina Khanagha, Timo Gerkmann · Interspeech 2026
In this work, we analyze the ability of NCSN++ U-Net based audio dereverberation models to capture global room characteristics in their intermediate representations. Through an empirical study of both a state-of-the-art diffusion-based model and a discriminative counterpart, we s...
Wei Fan, Chao-Hong Tan, Qian Chen ... · arXiv
Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthes...
Awais Khan, Kutub Uddin, Khalid Malik · arXiv
Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-...
Yejin Lee, Junwon Moon, Hyoeun Kim ... · arXiv
Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: spe...
Ruchao Fan, Yiming Wang, Yuxuan Hu ... · arXiv
Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech enco...
Thomas Rolland, Carlos Carvalho, Alberto Abad · arXiv
Transformer-based architectures have led to significant improvements in Automatic Speech Recognition (ASR), often at the cost of substantially increased model sizes. A promising approach to address this issue is layer sharing through depth recursion, commonly referred to as the R...
Sunday, June 07, 2026
Xiangyu Zhao, Junyu Yan, Yaling Shen ... · arXiv
Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in ...
Matteo Spanio, Mohammad Torabi, Andrea Poltronieri ... · Ital-IA 2026
Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs...
Yike Zhu, Ziqian Wang, Zikai Liu ... · Interspeech 2026
Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framewor...
Joonyong Park, Jungwoo Kim, Junyoung Koh ... · ICML 2026 ML4Audio workshop
AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that r...
Haoyu Zhang, Yuta Oshima, Xingjian Du ... · arXiv
Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignmen...
Moshe Mandel, Shlomo E. Chazan · arXiv
We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic input...
Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier ... · arXiv
Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be...
Vinh-Thuan Ly · Interspeech 2026
Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruc...
Hayato Komaba, Gen Sato, Ken Kurata ... · arXiv
Numerous machine learning-based sound field interpolation methods have been proposed. In particular, physics-informed neural networks (PINNs) can accurately interpolate sound fields from a small number of microphones. However, their high computational cost and long training time ...