Video dubbing is a cornerstone of multimedia content creation, aiming to synthesize synchronized acoustic sequences for visual streams. While Text-to-Speech (TTS) and Text-to-Audio (TTA) generation have each achieved remarkable progress, existing dubbing systems remain confined to isolated speech synthesis without incorporating sound effects and ambient audio, forcing practitioners to rely on fragmented workflows and laborious manual post-mixing. To address this limitation, we present HoliDubber, a holistic video dubbing framework that moves beyond speech-only generation by enabling the joint synthesis of speech and sound effects from a single text prompt. Specifically, HoliDubber adopts a patch-based autoregressive diffusion transformer architecture, where a causal language model autoregressively models aggregated patch embeddings to capture global temporal structure, and a Diffusion Transformer decoder generates high-fidelity continuous tokens within each patch, following a divide-and-conquer strategy. To achieve cross-modal alignment, visual features are encoded into patch-level representations and fused with audio patches via cross-attention, enabling the model to ground speech generation in the speaker's visual articulation dynamics. In addition, we introduce HoliDub-Bench, a benchmark curated from established datasets with synchronized video-text-audio triplets designed for holistic dubbing evaluation. Extensive experiments demonstrate that HoliDubber significantly outperforms existing methods across multiple benchmarks in speech quality, synchronization, and speaker similarity. Furthermore, results on HoliDub-Bench validate the effectiveness of joint speech-and-sound generation, establishing a new paradigm for holistic video dubbing in complex acoustic scenes. \footnote{The demo page of the project is https://holidubber.github.io}
Primary: Xiamen University
All Institutions: Shanghai Innovation Institute, Joy Future Academy, Shanghai Jiao Tong University, Xiamen University
HoliDubber represents a significant advancement in the field of audio synthesis by introducing a holistic framework for video dubbing that integrates speech and sound effects generation from a single text prompt. This comprehensive analysis highlights the model's innovative methodology, robust experimental validation, and potential impact on multimedia content creation, marking it as a noteworthy contribution to machine learning research in audio processing.
The methodology presented in HoliDubber is innovative, utilizing a patch-based autoregressive diffusion transformer architecture that integrates visual features with audio synthesis to achieve holistic dubbing. The approach of jointly generating speech and sound effects from a single text prompt is a significant advancement over traditional speech-only systems. The use of cross-attention for audio-visual fusion and a multi-stage training strategy enhances the model's ability to generate coherent audio that aligns with visual articulation dynamics. The introduction of HoliDub-Bench as a benchmark for evaluating holistic dubbing is a noteworthy contribution, providing a structured way to assess the model's performance in complex acoustic scenes.
The experiments conducted are extensive and rigorous, demonstrating HoliDubber's superior performance across multiple benchmarks in terms of speech quality, synchronization, and speaker similarity. The use of both objective metrics (like WER, UTMOS, and EMO-SIM) and subjective evaluations (like MOS) provides a comprehensive assessment of the model's capabilities. The results on HoliDub-Bench further validate the effectiveness of the proposed framework, showcasing its ability to handle diverse acoustic environments and maintain high-quality audio generation.
The paper provides detailed implementation details, including the training data, model architecture, and evaluation metrics, which are essential for reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for other researchers.
One limitation noted is the potential trade-off between speech quality and lip-sync accuracy, where the model may prioritize one over the other depending on the inference mode. Additionally, the reliance on specific datasets may limit the generalizability of the model to other contexts or languages not represented in the training data.
The implications of HoliDubber extend to various fields, including film production, game development, and digital media localization, where high-quality dubbing is essential. By providing a unified framework for generating both speech and sound effects, this research could streamline workflows in multimedia content creation and enhance user experience in interactive applications. The model's ability to generate audio that aligns closely with visual cues also opens up possibilities for more immersive storytelling and user engagement. HoliDubber represents a significant advancement in the field of audio synthesis by introducing a holistic framework for video dubbing that integrates speech and sound effects generation from a single text prompt. This comprehensive analysis highlights the model's innovative methodology, robust experimental validation, and potential impact on multimedia content creation, marking it as a noteworthy contribution to machine learning research in audio processing.
Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.
Primary: Eastern Institute of Technology
All Institutions: Eastern Institute of Technology, Shanghai Jiao Tong University, The Hong Kong Polytechnic University, Southeast University, Xi'an Jiaotong-Liverpool University
The main contribution of this paper is the introduction of AdaSR, an innovative framework for adaptive streaming reasoning that optimizes the reasoning process in dynamic environments through hierarchical policy optimization and adaptive rewards. This work represents a significant advancement in the field of machine learning, particularly in the context of real-time reasoning and decision-making under uncertainty.
The paper introduces AdaSR, a novel adaptive streaming reasoning framework that leverages reinforcement learning to optimize reasoning during dynamic input streams. The methodology is robust, incorporating Hierarchical Relative Policy Optimization (HRPO) to address the temporal credit assignment problem inherent in streaming reasoning. By decomposing the policy optimization into distinct phases, AdaSR allows for more nuanced advantage assignment, which is a significant improvement over traditional methods that apply uniform advantages across all tokens. The integration of adaptive rewards further enhances the model's ability to balance reasoning accuracy and computational efficiency.
The experiments are comprehensive, evaluating AdaSR against multiple benchmarks in reasoning tasks, including mathematical reasoning and context-based question answering. The results demonstrate significant improvements in accuracy and efficiency compared to baseline models, indicating the effectiveness of the proposed approach. The paper provides detailed metrics on accuracy, token lengths, and latency, which are critical for assessing the performance of streaming reasoning models.
The authors have released their code, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics that would facilitate easier replication of the experiments, such as hyperparameter settings and training configurations.
The paper acknowledges that AdaSR is primarily focused on text streams with verifiable answers, which may limit its applicability to more complex scenarios involving continuous audio or video streams. Additionally, the reliance on reinforcement learning may introduce challenges in training stability and convergence, which are not thoroughly addressed.
The proposed framework has the potential to significantly enhance real-time reasoning capabilities in various applications, including interactive AI systems, real-time translation, and autonomous agents. By enabling models to adaptively allocate computation based on input dynamics, AdaSR could lead to more responsive and efficient AI systems in real-world scenarios. The main contribution of this paper is the introduction of AdaSR, an innovative framework for adaptive streaming reasoning that optimizes the reasoning process in dynamic environments through hierarchical policy optimization and adaptive rewards. This work represents a significant advancement in the field of machine learning, particularly in the context of real-time reasoning and decision-making under uncertainty.
Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.
Primary: T-Tech
All Institutions: T-Tech, AI Foundation and Algorithm Lab
The experimental evaluation is comprehensive and well-structured, providing both quantitative and qualitative insights. 1. **Layer Sweep Analysis**: Figure 1 effectively illustrates the shift in feature modality composition and explained variance across layers. This provides valuable insights into the LM's internal processing, showing a transition from mixed/text-heavy to audio-heavy features in middle layers, and a surprising reversion to text-modal in the final hidden state. This analysis is novel for TTS LMs. 2. **Auto-Interp Quality**: Figure 2 provides quantitative evaluation of the auto-interpretation quality, showing text-modal labels are most verifiable (AUROC 0.921), followed by audio-modal (0.653), and mixed (0.558). While the mixed feature scores are lower, they are still above chance, and the qualitative examples (Table 1) show that even some mixed features are clean. 3. **Qualitative Examples**: Table 1 presents compelling examples of interpretable features, spanning phonemes, laughter, accent prompts, speaker gender, words, and even sub-lexical patterns. These examples strongly support the claim of interpretability. 4. **Feature Steering**: This is the highlight of the results. The paper demonstrates *causal control* over three distinct speech properties: * Laughter probability (0.015 to 0.791). * Perceived speaker gender (wav2vec2 P(male) from 0.629 to 0.944 or 0.063). * Speech rate (voiced duration from 3.96s to 10.57s or 2.75s). These results are highly impactful, showing that the identified features are not just descriptive but functional. The separation of gender steering by prompt speaker's original gender (Figure 4) adds further robustness. 5. **Concept Probing Experiments**: The appendix details supervised probes for laughter, emotion, and accent, showing that these concepts are linearly decodable early in the network and that SAE-latent probes closely track raw
The methodology is a well-executed adaptation of established mechanistic interpretability techniques (Sparse Autoencoders, LLM-based auto-interpretation) to a novel and challenging domain: the residual stream of a generative Text-to-Speech (TTS) Language Model. The core innovation lies in the "modality-aware auto-interp pipeline." This involves: 1. **SAE Training**: Training BatchTopK SAEs on the Qwen2.5-0.5B backbone of CosyVoice3, a standard and robust approach for decomposing activations. The choice of dictionary size and sparsity (d=16384, k=50) is reasonable for a model of this size. 2. **Evidence Extraction**: A crucial step that correctly identifies whether an activation occurs in the text-prefix or speech-token segment. This allows for modality-specific evidence (marked text window vs. 1-second audio clip). 3. **Feature Modality Tagging**: A simple yet effective heuristic (speech fraction > 0.8 for audio-modal, < 0.2 for text-modal, otherwise mixed) to categorize features. This is essential for guiding the auto-interpretation process. 4. **Automatic Labeling**: Using Gemini 3.0 Pro with modality-aware prompts is a clever way to leverage powerful LLMs for interpretation. The prompts are carefully designed to elicit specific descriptions based on the evidence type. 5. **Detection-Style Evaluation**: Adapting the protocol from text-only to mixed text/audio evidence, using rank-held-out examples and AUROC/balanced accuracy, provides a quantitative measure of label quality. 6. **SAE Feature Steering**: This is the most impactful methodological contribution. Instead of directly adding residual vectors, the intervention occurs in the SAE latent space, modifying selected feature activations and then decoding back to the residual stream. This demonstrates the *causal* role of the identified features, moving beyond mere correlation. The intervention mechanism is clearly described, ensuring locality to the SAE feature subspace.
The experimental evaluation is comprehensive and well-structured, providing both quantitative and qualitative insights. 1. **Layer Sweep Analysis**: Figure 1 effectively illustrates the shift in feature modality composition and explained variance across layers. This provides valuable insights into the LM's internal processing, showing a transition from mixed/text-heavy to audio-heavy features in middle layers, and a surprising reversion to text-modal in the final hidden state. This analysis is novel for TTS LMs. 2. **Auto-Interp Quality**: Figure 2 provides quantitative evaluation of the auto-interpretation quality, showing text-modal labels are most verifiable (AUROC 0.921), followed by audio-modal (0.653), and mixed (0.558). While the mixed feature scores are lower, they are still above chance, and the qualitative examples (Table 1) show that even some mixed features are clean. 3. **Qualitative Examples**: Table 1 presents compelling examples of interpretable features, spanning phonemes, laughter, accent prompts, speaker gender, words, and even sub-lexical patterns. These examples strongly support the claim of interpretability. 4. **Feature Steering**: This is the highlight of the results. The paper demonstrates *causal control* over three distinct speech properties: * Laughter probability (0.015 to 0.791). * Perceived speaker gender (wav2vec2 P(male) from 0.629 to 0.944 or 0.063). * Speech rate (voiced duration from 3.96s to 10.57s or 2.75s). These results are highly impactful, showing that the identified features are not just descriptive but functional. The separation of gender steering by prompt speaker's original gender (Figure 4) adds further robustness. 5. **Concept Probing Experiments**: The appendix details supervised probes for laughter, emotion, and accent, showing that these concepts are linearly decodable early in the network and that SAE-latent probes closely track raw
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{https://github.com/cactuswiththoughts/DDPO-VC}{https://github.com/cactuswiththoughts/DDPO-VC}} and demo\footnote{\href{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}}.
Primary: MIT CSAIL
All Institutions: MIT CSAIL, Boston University
The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
The proposed DDPO-VC framework effectively integrates reinforcement learning with diffusion models to address the dual challenge of privacy and utility in speaker de-identification. The methodology is well-structured, leveraging a conditional diffusion model and a novel reward mechanism that utilizes both privacy and utility teachers. This innovative approach allows for a more nuanced optimization of the privacy-utility tradeoff, which is critical in sensitive applications such as healthcare. The use of reinforcement learning to navigate complex correlations between variables is a significant advancement over traditional disentanglement methods.
The experiments are robust, utilizing two dementia speech benchmarks that are relevant and challenging. The results demonstrate clear superiority over existing methods in both privacy preservation and cognitive utility, with well-defined metrics such as AUC and EER. The comprehensive evaluation across multiple settings (zero-shot and fine-tuned) adds credibility to the findings. However, further details on the datasets and the specific configurations used in experiments would enhance the clarity of the evaluation.
The paper provides a GitHub repository and demo link, which is a positive aspect for reproducibility. However, the implementation details could be more explicit, particularly regarding hyperparameters and training procedures, to ensure that other researchers can replicate the results accurately.
One limitation noted is the potential for reward hacking due to the fixed nature of the privacy teacher. Additionally, the reliance on pretrained models for the privacy and utility teachers may limit the generalizability of the approach to other domains. The paper also acknowledges the need for more diverse evaluation metrics beyond naturalness and speaker similarity, indicating room for improvement in the evaluation framework.
The implications of this research are significant, particularly in fields where privacy is paramount, such as healthcare. By improving speaker de-identification methods, the framework can help protect sensitive information while still allowing for the utility of speech data in applications like dementia diagnosis and monitoring. The potential for broader applications in other audio domains and utility variables further enhances its relevance. The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new prompt, but this inversion-style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long-range musical structure. Motivated by recent inversion-free flow editing in computer vision, we develop an audio-specific direct source-to-target ordinary differential equation for one-dimensional Stable Audio 3 latents: at each flow step, we compare the target- and source-conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound-effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target-text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.
Primary: Nankai University
All Institutions: Nankai University
The main contribution of this paper is the introduction of AudEdit, a zero-shot text-guided audio editor that employs an inversion-free direct ODE for audio editing, significantly improving the trade-off between prompt adherence and source preservation. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in audio editing while leveraging state-of-the-art generative models.
The paper presents a novel approach to text-guided audio editing through an inversion-free method using pretrained rectified-flow audio models. The authors develop a direct source-to-target ordinary differential equation that allows for effective editing without the need for training or optimization. This methodology is innovative as it circumvents the common issues associated with inversion methods, particularly in preserving audio characteristics while adhering to new prompts. The integration of stochastic source marginals to refine the editing process is a noteworthy aspect that enhances the robustness of the approach.
The experiments are comprehensive, utilizing well-defined datasets for sound effects and music derived from established sources like FSD50K and the Song Describer Dataset. The evaluation metrics are robust, including both objective measures (like CLAP similarity and FAD) and subjective assessments (mean opinion scores). The results demonstrate clear improvements over baseline methods, indicating the effectiveness of the proposed method in achieving a balance between prompt adherence and source preservation.
The paper provides detailed implementation settings, including the configuration of the Stable Audio 3 model and the parameters used in experiments. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate validation by the research community.
The method is primarily designed for controlled edits and may struggle with broader semantic rewrites that require significant changes to the audio content. The authors acknowledge that the approach inherits limitations from the Stable Audio 3 backbone, including its reliance on specific conditioning and the lack of explicit temporal controls. Additionally, the method may introduce artifacts in cases where the target prompt demands extensive alterations.
This research has significant implications for audio editing in creative industries, such as music production and sound design, where maintaining the integrity of the original audio while allowing for meaningful edits is crucial. The inversion-free approach could streamline workflows for audio professionals, enabling more intuitive and efficient editing processes. Furthermore, the findings may inspire further research into generative audio models and their applications in various multimedia contexts. The main contribution of this paper is the introduction of AudEdit, a zero-shot text-guided audio editor that employs an inversion-free direct ODE for audio editing, significantly improving the trade-off between prompt adherence and source preservation. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in audio editing while leveraging state-of-the-art generative models.
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Institute of Forensic Science, Ministry of Public Security, The Hong Kong Polytechnic University
The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
The proposed DuraMark framework introduces a novel approach to watermarking in LLM-based TTS systems by embedding watermarks at the information level through syllable duration editing. This method is innovative as it leverages a duration-controllable TTS model and a duration extractor, which allows for precise control over the watermarking process while maintaining the naturalness of the synthesized speech. The integration of these components is well-structured, and the methodology is clearly articulated, allowing for a thorough understanding of the process.
The experiments conducted are robust, utilizing a substantial dataset and comparing DuraMark against established signal-level watermarking methods. The evaluation metrics include True Positive Rate (TPR) under various attack scenarios, which is a relevant measure of robustness. The results demonstrate DuraMark's superior performance, particularly against generative attacks, which is a critical aspect of the paper's claims. The use of both objective and subjective metrics to assess speech naturalness further strengthens the experimental evaluation.
The paper provides sufficient detail regarding the experimental setup, including the datasets used and the training parameters. However, the absence of a public code repository limits reproducibility. While the methodology is clearly described, access to the code would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a specific language (Mandarin Chinese) for the experiments, which may affect the generalizability of the findings to other languages or dialects. Additionally, while the paper demonstrates robustness against various attacks, it does not explore the performance of DuraMark under more extreme or novel attack scenarios that may arise in real-world applications.
The implications of this research are significant, particularly in the context of combating deepfake technologies and ensuring the integrity of synthesized speech. The DuraMark framework could be applied in various fields, including media, security, and digital forensics, where the authenticity of audio content is crucial. The potential for this technology to enhance trust in AI-generated content is noteworthy. The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient style learning and thus limiting speaker similarity in synthesized speech. To this end, we investigate the prosody learning conditioned on the synthesized speech, and propose to predict the prosody of the current syllable based on previously predicted speech. Experimental results obtained on three datasets demonstrated the efficacy of the proposed dynamic prosody prediction method in enhancing the prosody learning capability, thereby improving the speaker similarity of the generated speech. Audio samples are available at https://muzw.github.io/dynapros/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFLYTEK
The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
The proposed dynamic prosody prediction method represents a significant advancement in TTS technology by allowing for syllable-level prosody prediction based on previously generated speech. This approach addresses the limitations of existing methods that typically rely on static prosody modeling. The integration of prosody prediction into the speech generation process is well-justified and demonstrates a clear understanding of the challenges in personalized TTS systems. The methodology is sound, with a clear architecture and loss function defined, although the paper could benefit from more detailed explanations of the equations presented.
The experiments are comprehensive, utilizing three diverse datasets that cover a range of emotional and stylistic variations. The results are presented clearly, showing improvements in speaker similarity and prosody modeling capabilities. The use of both objective metrics (e.g., CER, emotion similarity) and subjective evaluations (e.g., MOS, preference tests) adds robustness to the findings. However, the paper could enhance its credibility by providing more detailed statistical analyses of the results, such as confidence intervals or significance testing.
The paper provides sufficient details regarding the experimental setup, including the datasets used, model architectures, and training procedures. The availability of the CosyVoice implementation and audio samples supports reproducibility. However, the lack of specific hyperparameter settings and training configurations for the proposed model could hinder complete reproducibility.
One limitation of the study is its focus on Mandarin Chinese, which may restrict the applicability of the findings to other languages or dialects. Additionally, while the proposed method shows promise in improving speaker similarity, the paper does not address potential challenges in real-world applications, such as the computational efficiency of the model during inference.
The proposed method has significant implications for the development of personalized TTS systems, particularly in applications such as virtual assistants, audiobooks, and entertainment. By improving speaker similarity, the approach could enhance user experience and engagement in various audio-related applications. Furthermore, the findings may inspire further research into dynamic prosody modeling in other languages and contexts. The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: https://free-sonic.github.io/
Primary: Tsinghua University
All Institutions: Tsinghua University, Alibaba Group, Monash University, Renmin University of China, Fudan University
FreeSonic presents a training-free framework for precise audio editing that leverages advanced attention mechanisms and noise injection techniques. The paper's contributions are significant, offering a novel approach to addressing longstanding challenges in the field of audio editing while demonstrating strong experimental validation and potential for broader applications.
The methodology presented in FreeSonic is innovative, combining a training-free approach with advanced techniques such as Rectified Flow-based models and joint text-audio attention maps. The introduction of scheduled attention decoupling and task-oriented noise injection is particularly noteworthy as it allows for precise audio editing while maintaining background integrity. The paper effectively addresses the challenges of temporal consistency and background preservation in audio editing, which are critical for high-fidelity audio applications.
The experimental evaluation is robust, utilizing both quantitative metrics (FAD, KL, IS, FD, CLAP) and subjective assessments (Mean Opinion Score) to validate the effectiveness of FreeSonic. The results demonstrate superior performance compared to existing training-free and training-based methods across various editing tasks. The ablation studies further reinforce the significance of each component in the proposed framework, showcasing a thorough understanding of the model's capabilities and limitations.
The paper provides a clear description of the experimental setup, including datasets and evaluation metrics, which supports reproducibility. However, the specifics of the implementation details, such as hyperparameter settings and the exact architecture of the model, could be more explicitly detailed to enhance reproducibility further.
One limitation of the study is the reliance on the performance of a single model architecture (TangoFlux) without exploring the potential of other architectures or hybrid approaches. Additionally, while the training-free aspect is a significant advantage, it may limit the model's adaptability to more complex audio editing scenarios that could benefit from fine-tuning.
FreeSonic has the potential to significantly impact the field of audio editing and generation, particularly in applications requiring high fidelity and precision, such as music production, film editing, and interactive media. The training-free nature of the approach could democratize access to advanced audio editing tools, allowing non-experts to achieve professional-quality results. FreeSonic presents a training-free framework for precise audio editing that leverages advanced attention mechanisms and noise injection techniques. The paper's contributions are significant, offering a novel approach to addressing longstanding challenges in the field of audio editing while demonstrating strong experimental validation and potential for broader applications.
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland
This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
The proposed methodology introduces a phoneme-guided cross-attention framework that significantly enhances the interpretability of speech deepfake detection systems. By leveraging phonetic posteriorgrams (PPGs) as a structural interface, the framework allows for a detailed analysis of the contribution of each phonetic class to the detection decision. This contrasts with traditional models that produce a single score without insight into the phonetic structure. The probabilistic factorization of the spoofing posterior into per-phone contributions is a novel approach that provides a clear, interpretable mechanism for understanding model behavior, which is a significant advancement in the field of explainable AI in speech processing.
The experimental evaluation is robust, utilizing three datasets of varying complexity, including a controlled corpus and standard benchmarks like ASVspoof 2019 LA. The results demonstrate competitive performance while also providing insights into the discriminative power of different phonetic categories. The targeted phoneme-group ablation study further validates the importance of articulatory categories, confirming the model's ability to isolate and rank the contributions of different phonetic classes effectively.
The paper lacks explicit details regarding the implementation and availability of the code or models, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a publicly accessible repository or demo limits the ability for other researchers to validate and build upon the findings.
One limitation is the reliance on the quality of the phonetic posteriorgrams, which may introduce noise or inaccuracies if the phoneme extraction process is not robust. Additionally, while the model shows promise in structured interpretability, it may still struggle with complex, real-world scenarios where the phonetic structure is less clear. The paper does not address potential biases in the datasets used for training and evaluation.
The implications of this work are significant, particularly in the context of forensic voice analysis and anti-spoofing measures in security systems. By enhancing the interpretability of deepfake detection, the framework could facilitate more reliable applications in legal and security settings, where understanding the basis of decisions is crucial. Furthermore, the integration of phonetic structures into detection systems may inspire new research avenues in both speech synthesis and recognition. This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.
Primary: Sharif University of Technology
All Institutions: Sharif University of Technology
The paper presents a novel deepfake detection model that effectively integrates audio-visual modalities through a teacher-student framework, demonstrating strong performance across multiple datasets and highlighting its potential for real-world applications. The comprehensive methodology and experimental validation contribute meaningfully to the ongoing efforts in combating deepfake technologies.
The proposed EAV-DFD model employs a robust teacher-student framework for domain adaptation, integrating audio, visual, and audio-visual modalities through an ensemble architecture. The methodology is well-structured, with clear delineation of the training processes for both teacher and student models, and the use of specialized loss functions enhances the model's adaptability to unseen domains. The incorporation of unimodal networks alongside the audio-visual network allows for effective handling of scenarios where one modality may be missing, which is a significant advantage in real-world applications.
The experiments are comprehensive, utilizing multiple datasets (FakeAVCeleb, DFDC, Deepfake_TIMIT, PolyGlotFake) to evaluate the model's performance across different domains. The results demonstrate significant improvements in AUC metrics, particularly in cross-domain generalization, which underscores the effectiveness of the proposed approach. The ablation studies provide valuable insights into the contributions of various components of the model, further validating the methodology.
The paper provides a GitHub repository link, which is crucial for reproducibility. However, detailed implementation specifics, such as hyperparameter settings and training configurations, could be more explicitly stated to facilitate easier replication of results by other researchers.
The model's performance may degrade under challenging conditions such as poor lighting or multi-speaker scenarios, indicating that further refinements are needed to enhance robustness. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other types of deepfake detection tasks.
This research has significant implications for the field of deepfake detection, particularly in enhancing the reliability of media content in sensitive contexts such as politics and security. The model's ability to adapt to new deepfake generation methods without catastrophic forgetting is particularly relevant as generative AI technologies continue to evolve. The paper presents a novel deepfake detection model that effectively integrates audio-visual modalities through a teacher-student framework, demonstrating strong performance across multiple datasets and highlighting its potential for real-world applications. The comprehensive methodology and experimental validation contribute meaningfully to the ongoing efforts in combating deepfake technologies.
With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares different watermark injection methods under realistic distribution shifts. To address this, we build VoxWatermark by applying 10 watermarking methods (4 neural and 6 traditional) with unified injection and annotation on multilingual, multi-source corpora, and introducing no-box, black-box, and white-box perturbations to simulate real recording and transmission conditions. Based on this benchmark, we propose AudioWMD as a robust baseline detector for large-scale, multi-method, cross-distribution settings. Results show that injection-method diversity and distribution shifts affect detection stability, while validating the effectiveness and scalability of AudioWMD. Dataset and code are publicly available.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, University of Tehran
The paper presents VoxWatermark, a large-scale benchmark for audio watermark detection, and proposes the AudioWMD framework, significantly advancing the state of research in audio watermarking and detection methodologies. The comprehensive evaluation of various watermarking methods under realistic perturbations provides valuable insights into the robustness of detection systems, paving the way for future advancements in the field.
The paper introduces a comprehensive methodology for audio watermark detection by constructing the VoxWatermark benchmark, which systematically evaluates various watermarking methods under different perturbation scenarios. The proposed AudioWMD framework employs a two-stage detection process that incorporates query-response stability analysis, enhancing robustness against adversarial attacks. The methodology is well-structured, addressing a significant gap in the existing literature regarding the evaluation of watermark detection systems.
The experiments are rigorously designed, utilizing a large-scale dataset of over 126,000 hours of audio across multiple languages and perturbation types. The authors provide a detailed comparison of their AudioWMD detector against a baseline model (WMD), demonstrating superior performance across various out-of-domain (OOD) test sets. The results highlight the effectiveness of the proposed approach in maintaining detection stability under realistic conditions, although performance does degrade under certain adversarial attacks.
The authors have made their dataset and code publicly available, which is a significant step towards ensuring reproducibility. The detailed description of the experimental setup, including data partitioning and evaluation protocols, further supports the reproducibility of the results. However, the reliance on specific hyperparameters and configurations may still pose challenges for complete replication.
One limitation noted is the vulnerability of the AudioWMD detector to certain black-box attacks, indicating that while the model improves robustness, it is not entirely immune to sophisticated adversarial strategies. Additionally, the performance under no-box perturbations approaches chance levels, suggesting that further work is needed to enhance resilience against common audio processing distortions.
This research has significant implications for the fields of audio security and copyright protection, particularly as synthetic audio generation becomes more prevalent. The development of a robust watermark detection system is crucial for ensuring the integrity and authenticity of audio content in various applications, including media production and digital forensics. The paper presents VoxWatermark, a large-scale benchmark for audio watermark detection, and proposes the AudioWMD framework, significantly advancing the state of research in audio watermarking and detection methodologies. The comprehensive evaluation of various watermarking methods under realistic perturbations provides valuable insights into the robustness of detection systems, paving the way for future advancements in the field.
A model can learn that the piano piece FĂĽr Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.
Primary: Institute of Information Engineering, CAS
All Institutions: Institute of Information Engineering, CAS, School of Cyber Security, UCAS, The University of Western Australia, Beihang University
This paper presents a significant contribution to the understanding of pathway-dependent forgetting in multimodal models, introducing a novel experimental protocol and providing compelling evidence that the route of knowledge acquisition affects retention. The rigorous methodology and comprehensive experimental evaluation enhance its relevance and potential impact in the field of machine learning, particularly in audio and music processing.
The methodology introduces the Paired Pathway Controlled Protocol (PPCP), which is a well-structured experimental framework that rigorously controls for variables affecting knowledge retention in multimodal models. The three-phase design effectively isolates the pathway as the primary variable, ensuring that the results are attributable to the acquisition route rather than confounding factors. The methodology is sound and addresses significant blind spots in existing research on forgetting in multimodal models.
The experiments are comprehensive, involving multiple architecturally distinct audio-language models and a variety of controls to validate the findings. The results consistently show that text-pathway knowledge is forgotten more than audio-pathway knowledge, providing robust evidence for the proposed hypothesis. The statistical analyses are thorough, and the use of controlled experiments enhances the credibility of the findings.
The paper provides detailed descriptions of the experimental setup, including training configurations and evaluation metrics, which supports reproducibility. The availability of the project URL with code further facilitates replication of the study.
While the study is robust, it primarily focuses on audio-language models within the music domain, which may limit the generalizability of the findings to other multimodal systems or domains. The paper also acknowledges that further exploration is needed to determine if the observed effects hold across different architectural families.
The implications of this research extend to the design of multimodal systems, suggesting that forgetting interventions should be pathway-aware. This could influence future work in continual learning, model editing, and unlearning, as well as applications in music understanding and retrieval systems. This paper presents a significant contribution to the understanding of pathway-dependent forgetting in multimodal models, introducing a novel experimental protocol and providing compelling evidence that the route of knowledge acquisition affects retention. The rigorous methodology and comprehensive experimental evaluation enhance its relevance and potential impact in the field of machine learning, particularly in audio and music processing.
While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.
Primary: Tianjin University
All Institutions: Tianjin University
The main contribution of this paper is the introduction of EChO-Agent, a modular framework that enhances audio reasoning through a structured pipeline for tool execution, evidence integration, and answer verification. This comprehensive analysis highlights the technical contributions and significance of the methodology in addressing existing challenges in audio question answering, establishing a foundation for future advancements in the field.
The proposed EChO-Agent framework introduces a structured four-stage pipeline that effectively addresses the limitations of existing Large Audio Language Models (LALMs) in audio reasoning tasks. By integrating tool-augmented observation, evidence integration, reasoning, and verification, the methodology emphasizes a systematic approach to audio question answering. The use of specialized audio tools for observations and a structured evidence chain for reasoning is innovative, as it allows for a more nuanced understanding of audio context and improves the model's ability to produce verifiable outputs. The framework's design is well thought out, ensuring that each component contributes to the overall goal of enhancing audio reasoning.
The experiments conducted on the MMAR benchmark provide a solid evaluation of the proposed method. The results demonstrate a significant improvement in accuracy and rubric scores compared to baseline models, indicating the effectiveness of the EChO-Agent framework. The ablation studies are particularly valuable, as they quantify the contributions of each component, reinforcing the importance of evidence integration and verification in the reasoning process. However, the paper could benefit from a more extensive comparison with a broader range of existing methods to contextualize its performance further.
The paper outlines a clear methodology and experimental setup, which aids in reproducibility. However, the lack of detailed implementation specifics, such as hyperparameters and the exact configurations used for the audio tools, limits the ability of other researchers to replicate the results fully. Providing access to code or supplementary materials would enhance reproducibility.
One limitation of the study is the reliance on specific audio tools, which may not generalize across all audio reasoning tasks. Additionally, while the framework shows promise, it has yet to be tested on a wider variety of datasets beyond the MMAR benchmark. The paper also does not address potential computational costs associated with the tool-augmented approach, which could impact scalability.
The EChO-Agent framework has the potential to significantly advance the field of audio reasoning and question answering, particularly in applications such as automated transcription, audio content analysis, and interactive audio systems. By improving the reliability and verifiability of audio reasoning, this work could lead to more robust AI systems capable of understanding complex audio environments, which is increasingly relevant in areas like virtual assistants, accessibility technologies, and multimedia content creation. The main contribution of this paper is the introduction of EChO-Agent, a modular framework that enhances audio reasoning through a structured pipeline for tool execution, evidence integration, and answer verification. This comprehensive analysis highlights the technical contributions and significance of the methodology in addressing existing challenges in audio question answering, establishing a foundation for future advancements in the field.
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Korea Advanced Institute of Science and Technology, Shanghai Jiaotong University
The main contribution of this paper is the introduction of AudioDER, a reasoning-oriented dataset designed to enhance the post-training of large audio-language models through a novel redundancy-aware construction pipeline. This work significantly advances the field by addressing the challenges of dataset redundancy and providing a comprehensive resource for improving audio reasoning capabilities in LALMs.
The proposed methodology is robust, focusing on a redundancy-aware data construction pipeline that effectively enhances the quality and diversity of training data for LALMs. The multi-stage process, which includes acoustic similarity-based deduplication, integration of existing annotations, and generation of CoT rationales, is well-structured and addresses key challenges in audio reasoning. The use of Qwen3-30B for rationale generation is particularly innovative, as it combines language understanding with audio processing to create a comprehensive dataset. The methodology is clearly articulated, with a logical flow from data collection to final dataset construction.
The experimental evaluation is thorough, demonstrating the effectiveness of the AudioDER dataset through extensive post-training experiments on multiple audio reasoning benchmarks. The results show consistent improvements in performance across various models, indicating that the dataset is not only well-constructed but also impactful in enhancing reasoning capabilities. The benchmarks chosen (MMAU-mini, MMSU, and MMAR) are relevant and challenging, providing a solid basis for evaluating the dataset's effectiveness.
The paper provides sufficient implementation details, including the architecture used (Qwen2-Audio-7B-Instruct), training parameters, and the experimental setup. However, the lack of a publicly available demo or interactive component limits the ease of reproducibility for external researchers. The open-source nature of the dataset is a positive aspect that encourages further exploration and validation by the community.
One limitation is the reliance on existing datasets for annotations, which may introduce biases inherent in those sources. Additionally, while the redundancy filtering process is beneficial, it may inadvertently remove samples that could contribute valuable diversity. The paper does not address potential scalability issues related to the dataset size or the computational resources required for post-training on larger models.
The AudioDER dataset has significant potential for advancing research in audio reasoning and LALMs. By providing a high-quality, structured dataset, it can facilitate the development of more capable audio understanding systems, which could have applications in various fields such as accessibility, education, and entertainment. The emphasis on reducing redundancy also highlights a critical area for improvement in dataset construction practices across machine learning. The main contribution of this paper is the introduction of AudioDER, a reasoning-oriented dataset designed to enhance the post-training of large audio-language models through a novel redundancy-aware construction pipeline. This work significantly advances the field by addressing the challenges of dataset redundancy and providing a comprehensive resource for improving audio reasoning capabilities in LALMs.
Explainable and trustworthy speech emotion recognition (SER) remains a challenging task to date, largely due to the scarcity of SER data with reliable speech emotion descriptor (SED) labels, such as prosodic features and speaker traits. This paper presents a confidence score and reinforcement learning (RL) based on-the-fly SED rectification approach for post-training SER systems on automatically annotated SED labels. Experiments on IEMOCAP and MELD suggest that explainable SER systems incorporating the proposed confidence score and RL-based SED rectification approach consistently outperform baselines without data selection or SED rectification. The best performing system, which integrates both components, surpasses the baseline without data selection and SED rectification, achieving SER gains of 2.9% and 3.3% absolute (3.7% and 5.4% relative) on IEMOCAP and MELD benchmarks, respectively.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Chinese Academy of Sciences, National Research Council Canada, Tsinghua University
The main contribution of this paper is the introduction of a confidence score and reinforcement learning-based approach for rectifying speech emotion descriptors in SER systems, which significantly improves both performance and explainability. This work addresses critical challenges in SER, providing a robust framework that could influence future research and applications in emotion recognition and related fields.
The paper proposes a novel methodology that integrates a confidence score-based data selection method and a reinforcement learning (RL)-based SED rectification approach for improving SER systems. The confidence estimation model (CEM) is well-structured, employing a multi-layer perceptron (MLP) to evaluate the reliability of automatically annotated SED labels. The RL-based SED Controller adds a dynamic element to the training process, allowing for real-time adjustments to SED labels, which is a significant advancement over static label approaches. The methodology is clearly articulated and demonstrates a thoughtful approach to addressing the limitations of existing SER systems.
The experiments conducted on the IEMOCAP and MELD datasets are comprehensive, comparing the proposed system against various baselines and existing state-of-the-art models. The results show consistent improvements in SER performance, with statistically significant gains in accuracy. The use of t-SNE visualizations to illustrate the clustering of emotion categories adds depth to the analysis, demonstrating the effectiveness of the proposed methods in enhancing the explainability and trustworthiness of SER systems.
The paper provides sufficient detail regarding the experimental setup, including model architectures, training procedures, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. The authors should consider releasing their code and trained models to facilitate further research and validation of their findings.
One limitation is the reliance on automatically annotated SED labels, which may still introduce noise despite the proposed rectification methods. Additionally, while the paper demonstrates improvements on two datasets, the generalizability of the approach to other SER tasks or languages remains untested. The impact of varying the threshold for confidence score selection on performance is also not thoroughly explored.
The advancements in explainable and trustworthy SER systems have significant implications for human-computer interaction, particularly in applications such as virtual assistants, mental health monitoring, and customer service automation. By enhancing the interpretability of emotion recognition systems, this research could foster greater user trust and acceptance of AI technologies in sensitive areas. The main contribution of this paper is the introduction of a confidence score and reinforcement learning-based approach for rectifying speech emotion descriptors in SER systems, which significantly improves both performance and explainability. This work addresses critical challenges in SER, providing a robust framework that could influence future research and applications in emotion recognition and related fields.
We present FoleyGenEx, a unified video-to-audio (VTA) framework integrating multi-modal control, frame-level temporal alignment, and fine-grained semantics, enabling synchronized, versatile audio synthesis for diverse tasks. Existing VTA methods either have multi-modal control but weak temporal alignment or strong alignment but lack reference audio conditioning and semantic precision. FoleyGenEx fills this gap via three core innovations: a conditional injection mechanism for audio-controlled VTA and Foley extension, a multi-modal dynamic masking strategy preserving training synchronization, and an adverb-based data augmentation algorithm leveraging signal processing and large language models to enhance textual supervision with nuanced semantics. Experiments on AudioCaps, VGGSound, and Greatest Hits demonstrate its competitive controllable VTA performance against existing methods. Demo samples are available at https://foleygenex.github.io/FoleyGenEx.
Primary: Nankai University
All Institutions: Nankai University, Kuaishou Technology
FoleyGenEx presents a significant advancement in video-to-audio generation, effectively addressing key limitations of existing methods through innovative architectural and methodological contributions. The integration of multi-modal control, temporal alignment, and semantic precision positions this work as a valuable addition to the field of generative audio systems.
The methodology introduced in FoleyGenEx is robust, integrating a conditional injection mechanism and a multi-modal dynamic masking strategy to enhance temporal alignment and semantic precision in video-to-audio generation. The use of adverb-based data augmentation is particularly innovative, addressing the scarcity of nuanced training data and enabling fine-grained control over audio generation. The architecture builds upon the MMDiT framework, which is a solid choice for cross-modal tasks, and the paper clearly delineates how each component contributes to the overall performance improvements.
The experiments conducted on multiple datasets (AudioCaps, VGGSound, and Greatest Hits) are comprehensive and demonstrate the effectiveness of FoleyGenEx against existing methods. The metrics used for evaluation, including distribution matching and semantic alignment, are appropriate for the task. The inclusion of subjective evaluations (Good, Same, Bad study) adds depth to the assessment of the model's performance, particularly regarding the adverb augmentation.
The paper provides sufficient implementation details, including training configurations and dataset descriptions, which would allow for reproducibility. However, the lack of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on specific datasets, which may not generalize across all types of video-to-audio tasks. Additionally, the paper does not address potential biases in the training data or the implications of using large language models for data augmentation, which could affect the model's performance in real-world scenarios.
The advancements made in FoleyGenEx have significant implications for applications in multimedia content creation, enhancing user experience through synchronized audio generation. The ability to generate audio that is semantically aligned with video content can improve accessibility and engagement in various media formats, including film and gaming. FoleyGenEx presents a significant advancement in video-to-audio generation, effectively addressing key limitations of existing methods through innovative architectural and methodological contributions. The integration of multi-modal control, temporal alignment, and semantic precision positions this work as a valuable addition to the field of generative audio systems.
Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) Entity-Anchored Video Scripting transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) Clue-Guided QA Generation prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset OmniVideo-100K and a human-verified test set, OmniVideo-Test. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.
Primary: Nanjing University
All Institutions: Nanjing University, CASIA
The paper presents OmniVideo-100K, a novel dataset and methodology for enhancing audio-visual reasoning in question-answering tasks. This work significantly contributes to the field by addressing existing limitations in audio-visual QA systems and demonstrating substantial performance improvements through innovative data generation techniques.
The proposed methodology introduces a two-stage automated data generation pipeline that enhances audio-visual QA by integrating structured scripting and clue-guided QA generation. This approach addresses significant limitations in existing methods by ensuring cross-segment referential consistency and promoting deep cross-modal reasoning, which is a notable advancement in the field.
The experiments demonstrate robust performance improvements across various benchmarks, with fine-tuned models showing gains of up to 20.59% on the human-verified test set. The comprehensive evaluation on established benchmarks further validates the effectiveness of the proposed dataset and methodology.
The paper provides sufficient details regarding the experimental setup, including model configurations and dataset construction, which supports reproducibility. However, the absence of a public demo or interactive tool limits immediate accessibility for verification.
The paper does not address potential biases in the dataset construction process or the limitations of the automated pipeline in generating high-quality QA pairs. Additionally, the reliance on LLMs may introduce inherent biases or inaccuracies in the generated outputs.
The research has significant implications for advancing audio-visual understanding in AI, particularly in applications like video analysis, interactive media, and educational tools. The structured scripts and QA pairs can serve as valuable resources for further research and development in multimodal AI systems. The paper presents OmniVideo-100K, a novel dataset and methodology for enhancing audio-visual reasoning in question-answering tasks. This work significantly contributes to the field by addressing existing limitations in audio-visual QA systems and demonstrating substantial performance improvements through innovative data generation techniques.
Sound events are entities with semantic identities, locations, and trajectories, but current audio-language models usually reason about clips as global event content. Conversely, sound event localization models track source directions over time but offer limited semantic coverage for language reasoning. To address this gap, we introduce ST-AudioQA, a spatio-temporal audio QA dataset and benchmark built from first-order ambisonic (FOA) renderings of static and moving sound sources. Each scene provides source identity, activity, direction, distance, and motion metadata, enabling dense trajectory supervision and questions about what is sounding, where it is, how it moves, and how sources relate. We further propose ST-Audio Encoder, a time-resolved FOA audio encoder that learns event semantics together with source trajectories, and ST-AudioLM, which connects the audio tokens from the encoder to an LLM for spatio-temporal audio QA. Experiments show that this representation improves the semantic-localization tradeoff and yields stronger reasoning performance than static spatial and localization-oriented baselines.
Primary: Sony AI
All Institutions: Sony AI
The paper presents a novel approach to spatio-temporal audio language modeling, introducing a dataset and methodologies that enhance the understanding of dynamic sound sources. The comprehensive evaluation and innovative contributions position this work as a significant advancement in the field of audio processing and machine learning.
The methodology presented in this paper is robust and innovative, combining a novel spatio-temporal audio QA dataset with a specialized audio encoder and an audio-language model. The use of first-order ambisonic (FOA) renderings to create a controlled benchmark for dynamic sound sources is a significant advancement in the field. The proposed ST-Audio Encoder effectively learns event semantics alongside source trajectories, and the integration with a large language model (LLM) for audio QA is a noteworthy contribution. The structured approach to generating QA pairs from the rendered audio scenes demonstrates a clear understanding of the complexities involved in audio-language reasoning.
The experiments are comprehensive, comparing the proposed models against various baselines, including static and dynamic encoders. The results clearly indicate that the ST-Audio Encoder outperforms existing models in terms of semantic recognition and spatial localization. The evaluation metrics are well-defined, and the experiments cover a range of scenarios, from single-source perception to complex two-source grounding and compositional reasoning. This thorough evaluation strengthens the paper's claims regarding the effectiveness of the proposed methods.
The paper provides sufficient details on the implementation, including the architecture of the ST-Audio Encoder and the training procedures for both the encoder and the LLM. However, the lack of a publicly available demo or project URL limits the reproducibility aspect, as external researchers cannot easily verify the results or utilize the proposed methods without access to the code or data.
The primary limitation noted in the paper is the reliance on controlled synthetic rendering, which may not fully capture the complexities of real-world acoustic environments. Additionally, the benchmark simplifies dynamic scenes, potentially overlooking important acoustic phenomena such as Doppler effects and non-monotonic motion. The authors also acknowledge the need for broader real-world evaluation and the potential biases inherited from the dataset used.
The advancements in spatio-temporal audio language modeling have significant implications for various applications, including robotics, augmented reality, and immersive audio experiences. By improving the understanding and reasoning capabilities of audio-language models, this research paves the way for more sophisticated human-computer interaction systems that can interpret and respond to dynamic auditory environments. The paper presents a novel approach to spatio-temporal audio language modeling, introducing a dataset and methodologies that enhance the understanding of dynamic sound sources. The comprehensive evaluation and innovative contributions position this work as a significant advancement in the field of audio processing and machine learning.
Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Jilin University, Hunan University, University of Electronic Science and Technology of China
This paper presents a critical evaluation of spatial audio models, revealing their limitations in phase encoding and suggesting directions for future research. The comprehensive methodology and significant findings contribute meaningfully to the field of machine learning in audio processing.
The methodology is robust, employing a psychoacoustic benchmark based on binaural masking level difference (BMLD) to evaluate various audio models. The authors effectively utilize a combination of frozen self-supervised learning models and analytical baselines to assess the encoding of interaural phase cues. The progressive physical ablation approach is particularly noteworthy as it isolates the detection mechanisms, providing clear insights into the models' reliance on spectro-temporal interference rather than genuine phase computation.
The experiments are comprehensive, involving nine different audio models and a variety of conditions to ensure thorough evaluation. The use of both synthetic targets and realistic speech excerpts enhances ecological validity. The results are presented clearly, with detailed comparisons against established baselines, and the statistical methods employed for significance testing are appropriate. The findings reveal critical insights into the limitations of current models in encoding phase information, which is essential for spatial audio perception.
The paper provides sufficient detail regarding the models, stimuli, and evaluation metrics, which supports reproducibility. However, the lack of publicly available code or datasets limits the ease with which others can replicate the experiments. Including a link to a GitHub repository or similar would enhance reproducibility significantly.
One limitation is the reliance on frozen models, which may not fully capture the dynamic nature of audio processing in real-world applications. Additionally, while the study identifies the shortcomings of general-purpose binaural models, it does not extensively explore potential solutions or improvements. The ecological confound in realistic speech conditions could also be further investigated to understand its implications better.
The findings have significant implications for the development of future spatial audio models, particularly in enhancing their ability to encode phase information, which is crucial for accurate sound localization. This research could influence the design of audio processing systems in various applications, including virtual reality, hearing aids, and immersive audio experiences. This paper presents a critical evaluation of spatial audio models, revealing their limitations in phase encoding and suggesting directions for future research. The comprehensive methodology and significant findings contribute meaningfully to the field of machine learning in audio processing.
We present target speaker tagging (TST), a task that integrates speaker diarization, verification, and identification into a unified workflow for multi-speaker conversations. Given long recordings and pre-enrolled speakers, TST detects and labels speech segments of known speakers while rejecting unknown ones. Despite its practical importance, research has been limited by the absence of suitable evaluation resources. To address this, we introduce TST-Bench, a large-scale synthetic benchmark with over 150 enrolled speakers, 300 sessions of 20-60 minutes, and reference annotations with global speaker labels. We define an evaluation protocol encompassing diarization and full-pipeline scenarios. Experiments on both real and synthetic data show that TST poses challenges not captured by conventional benchmarks, and that dedicated system design yields significant gains over naive integration of existing solutions. The benchmark dataset and evaluation protocols are publicly released.
Primary: NAVER Cloud Corporation
All Institutions: NAVER Cloud Corporation, NAVER Corporation
The paper presents a novel task and benchmark for target speaker tagging that integrates multiple aspects of speaker recognition, filling a critical gap in the field. The comprehensive methodology and robust experimental evaluation underscore its potential impact on real-world applications and future research directions.
The paper introduces the Target Speaker Tagging (TST) task, which is a novel integration of speaker diarization, verification, and identification. The methodology is well-structured, detailing the system's components and their interactions. The authors provide a clear definition of the TST task and articulate its significance in real-world applications. The construction of TST-Bench, a large-scale synthetic benchmark, is a significant methodological contribution, allowing for systematic evaluation of TST systems. The approach to combining speaker embeddings from multiple segments to enhance identification accuracy is particularly noteworthy.
The experimental setup is robust, utilizing both synthetic and real datasets to validate the proposed methods. The authors present thorough evaluations across different scenarios, demonstrating the effectiveness of their approach. Results indicate that the TST framework outperforms naive integrations of existing methods, highlighting the importance of dedicated system design. The use of metrics like Detection and Identification Rate (DIR) and False Alarm Rate (FAR) provides a comprehensive view of system performance.
The paper provides sufficient detail regarding the implementation of the TST system, including the use of specific models and techniques for speaker diarization and identification. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. The authors mention that the dataset and evaluation scripts are publicly released, which is a positive aspect for reproducibility.
The paper acknowledges limitations related to the synthetic nature of TST-Bench, particularly the differences between synthetic and real conversational dynamics. The authors also note that while synthetic data allows for controlled experiments, it may not capture all the complexities of natural speech interactions. Additionally, the reliance on a specific type of speech (read speech) may not fully represent the variability found in spontaneous conversations.
The TST framework and benchmark have the potential to significantly advance the field of speaker recognition by providing a unified evaluation approach that reflects real-world challenges. This work could lead to improved systems for applications such as meeting transcription, voice-based services, and multi-session analytics. By addressing the limitations of existing benchmarks, the authors encourage further research and development in integrated speaker recognition systems. The paper presents a novel task and benchmark for target speaker tagging that integrates multiple aspects of speaker recognition, filling a critical gap in the field. The comprehensive methodology and robust experimental evaluation underscore its potential impact on real-world applications and future research directions.
Recent advances in speech generation have significantly improved the naturalness of synthetic speech, making spoofing detection increasingly challenging. A key limitation of current anti-spoofing systems is their limited robustness to unseen synthesis methods. In this work, we transform a self-supervised speech representation model into a Mixture-of-Experts (MoE) architecture to improve generalization. Feed-forward blocks in selected encoder layers are replaced by multiple expert networks controlled by a layer-wise gating mechanism, allowing experts to capture complementary acoustic patterns while preserving the representations learned during self-supervised pretraining. We further analyze the architectural choices affecting the performance of this MoE conversion and investigate the activation behavior of the experts. The proposed approach is evaluated on 14 spoofing datasets and reduces the macro EER from 5.46% to 4.81%, corresponding to 11.9% relative improvement over the baseline.
Primary: affiliation=1 Mickael
All Institutions: affiliation=1 Mickael, affiliation=1 Driss, affiliation=2 Khaled
The main contribution of this paper is the introduction of a Mixture-of-Experts architecture for speech anti-spoofing, which enhances generalization capabilities compared to traditional methods. The technical contributions are significant, as they provide a new framework for improving the robustness of speech models against evolving spoofing techniques, which is crucial in an era of advanced synthetic speech technologies.
The paper presents a novel approach by converting a self-supervised speech model into a Mixture-of-Experts (MoE) architecture, which is a significant departure from existing methods that utilize low-rank adaptations. The methodology is well-structured, detailing the conversion process, gating mechanisms, and expert activation strategies. The authors also conduct a comprehensive architectural study, analyzing the effects of various design choices on performance, which adds depth to the methodology.
The experimental setup is robust, utilizing 14 diverse spoofing datasets to evaluate the proposed approach. The reduction in macro EER from 5.46% to 4.81% demonstrates a meaningful improvement in performance. The paper includes detailed results across different configurations, providing a clear comparison with baseline methods and LoRA-based approaches, which strengthens the validity of the findings.
The implementation details are sufficiently described, including the training protocols, datasets, and evaluation metrics. However, the paper lacks a direct link to the code or models, which could hinder reproducibility for other researchers. Providing a project URL would enhance this aspect significantly.
While the proposed MoE architecture shows improved performance, the analysis of expert specialization indicates that there is no clear routing specialization across different synthesizers. This could suggest limitations in the model's ability to adapt to various spoofing techniques. Additionally, the increased number of parameters in the MoE configuration may raise concerns about model efficiency.
The work addresses a critical challenge in the field of speech processing, particularly in the context of anti-spoofing, which has significant implications for security and trust in voice technologies. The findings could influence future research directions in robust speech recognition and synthesis, as well as applications in security systems. The main contribution of this paper is the introduction of a Mixture-of-Experts architecture for speech anti-spoofing, which enhances generalization capabilities compared to traditional methods. The technical contributions are significant, as they provide a new framework for improving the robustness of speech models against evolving spoofing techniques, which is crucial in an era of advanced synthetic speech technologies.
Recent alignment-free non-autoregressive (NAR) text-to-speech (TTS) models formulate synthesis as a conditional infilling task, bypassing explicit duration predictors and external aligners. When speech is represented with neural codec tokens, the infilling problem becomes discrete, making Discrete Flow Matching (DFM), a Continuous-Time Markov Chain (CTMC) framework for discrete generation, a natural fit. However, inference-time control for stable low-step conditional infilling remains underexplored. We propose Mask, Sample, Revise, an inference-time CTMC stack for alignment-free DFM-TTS. The stack combines predictor-free guidance to strengthen text conditioning, prompt-matched conditional coupling to align the probability path with the acoustic prompt, and SC-ReMask, a schedule-constrained remasking mechanism that introduces token-to-mask transitions so early de-masking decisions can be revised. These components require no post-hoc fine-tuning and operate in a single tau-leaping sampler. Controlled ablations show that this stack improves intelligibility and robustness in the low-NFE prompted setting, outperforming unguided and guidance-only samplers with substantially more steps.
Primary: Federal University of Goiás
All Institutions: Federal University of Goiás, Elsa Speak
The paper presents G-DFlow-TTS, an innovative alignment-free text-to-speech system that significantly improves intelligibility and robustness through a novel inference stack. The methodology is well-articulated, and the experimental results demonstrate the effectiveness of the proposed approach, marking a meaningful contribution to the field of machine learning in audio synthesis.
The proposed methodology introduces a novel inference-time control stack for non-autoregressive text-to-speech synthesis, leveraging Continuous-Time Markov Chains (CTMC) to enhance discrete flow matching. The integration of predictor-free guidance, conditional coupling, and a remasking mechanism (SC-ReMask) represents a significant advancement in the field, allowing for revisable token generation. The approach is well-structured and addresses the limitations of existing models by focusing on inference-time controls rather than solely increasing sampling steps.
The experiments are robust, utilizing both objective metrics (WER, CER) and subjective evaluations (MOS) to assess the performance of the proposed G-DFlow-TTS system against baselines. The controlled ablations provide clear insights into the contributions of each component of the proposed stack, demonstrating significant improvements in intelligibility and robustness. The use of a well-defined dataset (LibriSpeech) further strengthens the evaluation.
The paper provides detailed implementation specifics, including model architecture, training parameters, and evaluation protocols, which enhances reproducibility. However, the lack of a public code repository limits the ability for independent verification of results.
One notable limitation is the reliance on a single dataset for evaluation, which may not fully capture the generalizability of the model across diverse speech patterns and languages. Additionally, the paper acknowledges that speaker similarity remains below that of larger external systems, indicating potential areas for improvement.
The advancements in alignment-free text-to-speech synthesis have significant implications for applications in voice assistants, audiobooks, and accessibility technologies. The ability to revise token decisions during generation could lead to more natural-sounding speech synthesis, enhancing user experience in various audio applications. The paper presents G-DFlow-TTS, an innovative alignment-free text-to-speech system that significantly improves intelligibility and robustness through a novel inference stack. The methodology is well-articulated, and the experimental results demonstrate the effectiveness of the proposed approach, marking a meaningful contribution to the field of machine learning in audio synthesis.
We present MaskedFOP, a system for closed-set polyglot speaker identification under two simultaneous challenges: the face modality is entirely absent at test time, and speech comes from Urdu, a language unseen during face-supervised training. The system integrates three complementary mechanisms. First, a modality-dropout dual-head network built on the Fusion and Orthogonal Projection (FOP) backbone forces the audio branch to develop independent discriminative power via per-sample face masking, ensuring that the audio encoder remains capable when face is absent. Second, two MaskedFOP instances trained on Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) features with different random seeds produce complementary audio embeddings whose element-wise average yields a more robust 512-dimensional representation than any single model. Third, a two-stage cascaded inference procedure first refines multimodal labels through a fused Graph Label Propagation (GLP) pass (Stage 1), then assigns audio-only labels by cosine nearest-centroid (Stage 2), replacing the 70 sparse training prototypes with ~1,500 in-domain test-set centroids from Stage 1. Submitted to the POLY-SIM 2026 Grand Challenge, the system achieves a mean P-accuracy of 0.9989, placing first among all submissions evaluated on the challenge server. An ablation identifies cascaded seeding as the single largest gain (>8 pp on P4/P6). The code is available at https://github.com/Ayoub-Elkhouzari/POLY-SIM2026.
Primary: University Mohammed VI Polytechnic
All Institutions: University Mohammed VI Polytechnic
The paper presents MaskedFOP, a novel system for polyglot speaker identification that excels in scenarios with missing visual modalities, achieving state-of-the-art performance in a challenging evaluation setting. The integration of advanced techniques such as modality dropout, multi-seed averaging, and cascaded label propagation showcases a significant advancement in the field of speaker recognition and multimodal learning.
The methodology presented in MaskedFOP is innovative, integrating a dual-head modality-dropout network with a cascaded graph label propagation approach. The use of per-sample face masking during training effectively enhances the audio branch's robustness when the visual modality is absent. The multi-seed averaging technique for audio embeddings further improves the stability of the representations. The two-stage inference process, which refines multimodal labels before assigning audio-only labels, is a significant advancement in handling missing modalities in speaker identification.
The experimental evaluation is thorough, leveraging the POLY-SIM 2026 Grand Challenge dataset, which provides a robust benchmark for assessing the model's performance under challenging conditions. The reported mean P-accuracy of 0.9989 is impressive, particularly given the complexities of cross-lingual speaker identification and missing visual modalities. The ablation studies effectively demonstrate the contributions of each component, highlighting the importance of the cascaded seeding strategy.
The paper provides sufficient implementation details, including hyperparameters and training procedures, which are crucial for reproducibility. The availability of the code on GitHub further enhances the potential for other researchers to replicate the results. However, the reliance on fixed pre-extracted features might limit the adaptability of the approach to other datasets or modalities.
The primary limitations include the closed-set assumption, which may not generalize to open-set scenarios, and the dependence on English-trained features, which could affect performance on other languages or dialects. Additionally, the transductive nature of the inference process requires the entire unlabeled test partition at once, which may not be feasible in all applications.
The proposed system has significant implications for biometric recognition systems, particularly in multilingual and multimodal contexts. It could enhance applications in security, user authentication, and personalized services where speaker identification is critical. The methodology could also inspire future research in cross-modal learning and robust speaker recognition under varying conditions. The paper presents MaskedFOP, a novel system for polyglot speaker identification that excels in scenarios with missing visual modalities, achieving state-of-the-art performance in a challenging evaluation setting. The integration of advanced techniques such as modality dropout, multi-seed averaging, and cascaded label propagation showcases a significant advancement in the field of speaker recognition and multimodal learning.
We show that the three movements of Beethoven's "Moonlight Sonata" (Op. 27 No. 2) instantiate three distinct machine learning architectures -- not by analogy, but by structural correspondence. Through computational analysis of the score (entropy, Jensen-Shannon divergence, dissonance, hand distributional overlap, self-similarity matrices, temporal memory decay, and contextual pitch embeddings), we establish four counterintuitive findings: (1) perceived musical "temperature" is governed by throughput, not distributional width; (2) the lightest movement carries the highest dissonance; (3) the movements implement streaming, recurrent, and periodic positional encoding memory architectures; and (4) the same pitch class acquires different contextual identities across movements, analogous to contextual vs.static embeddings in NLP -- and unsupervised clustering recovers the tonal structure without music-theoretic input. We construct a reverse sonification (decoding analytical features back into MIDI) and quantify the chirality of the encode-decode cycle: what distributions preserve and sequential ordering destroys. Prompted by a listener's observation that the decoded piece sounds like "mirror isomers that can't be superimposed," the chirality measurement reveals reconstruction loss increasing monotonically with n-gram order. Bootstrap baselines and subsample checks confirm all movements carry sequential information above noise, though raw values are confounded by sample size. Cross-domain comparison shows natural language has higher chirality than music, reflecting stronger sequential constraints.
Primary: Claude Code / Opus 4.6
All Institutions: Claude Code / Opus 4.6, API / Fable 5, Independent researcher
The main contribution of this paper is the establishment of a formal structural isomorphism between Beethoven's "Moonlight Sonata" and machine learning architectures, revealing deep connections between music and computational mechanisms. This work significantly advances the understanding of both domains and proposes a novel methodology that integrates human perception with computational analysis, offering new insights into the nature of music and its relationship with machine learning.
The paper employs a novel approach by establishing a structural isomorphism between musical compositions and machine learning architectures, utilizing various computational analyses such as Shannon entropy, Jensen-Shannon divergence, and self-similarity matrices. The methodology is rigorous, employing both quantitative and qualitative analyses, and introduces a reverse sonification process that allows for the exploration of chirality in the encode-decode cycle. This feedback loop between human perception and computational analysis is a significant methodological innovation.
The experiments conducted are comprehensive, analyzing Beethoven's "Moonlight Sonata" across its three movements. The authors provide a detailed breakdown of the metrics used, including entropy, dissonance, and memory decay, and validate their findings through bootstrap baselines and subsampling checks. The results are presented clearly, demonstrating the structural correspondences between music and ML mechanisms, with counterintuitive findings that challenge existing assumptions about music theory and machine learning.
The paper includes a repository with all code, data, figures, and generated MIDI files, which enhances reproducibility. However, the analysis operates at a symbolic level rather than a signal level, which may limit the ability to fully reproduce the auditory experience of the original music. The authors transparently report their methods and findings, including limitations related to sample size and the metrics used.
The analysis is limited to symbolic representations of music, neglecting aspects like timbre and dynamics that are crucial for a complete understanding of musical perception. Additionally, the reverse sonification process simplifies rhythmic structures, which could lead to a loss of important musical information. The chirality measurement is also bounded by n-gram order, potentially overlooking higher-order dependencies in musical structure.
This research has the potential to influence multiple fields, including computational musicology, machine learning, and cognitive neuroscience. By establishing a formal correspondence between music and ML mechanisms, it opens avenues for interdisciplinary research and applications, such as improved music generation models and enhanced understanding of musical cognition. The findings could also inspire new methodologies in analyzing other forms of art and complex systems. The main contribution of this paper is the establishment of a formal structural isomorphism between Beethoven's "Moonlight Sonata" and machine learning architectures, revealing deep connections between music and computational mechanisms. This work significantly advances the understanding of both domains and proposes a novel methodology that integrates human perception with computational analysis, offering new insights into the nature of music and its relationship with machine learning.
This paper investigates the fragility of post-hoc explanation methods in audio deepfake detection. While previous work on explanation manipulation focused on images using standard $L_p$ metrics, we introduce a psychoacoustic framework that optimizes inaudible perturbations to decouple model attributions from final classifications. We evaluate this vulnerability across state-of-the-art architectures under strict prediction-preserving constraints. By evaluating the manipulation cost through domain-specific perceptual audio quality metrics alongside explanation alignment criteria, our framework demonstrates that an adversary can systematically distort automated explanation heatmaps while preserving the predicted deepfake label. Full code available at: https://github.com/cncPomper/Audio-XAI
Primary: Warsaw University of Technology
All Institutions: Warsaw University of Technology
This paper provides a crucial investigation into the vulnerabilities of audio deepfake detection systems, demonstrating that attribution maps can be manipulated while preserving predictions and audio quality. The innovative psychoacoustic approach and thorough experimental evaluation contribute significantly to the understanding of explainability in audio models, marking a step forward in the field.
The paper introduces a novel psychoacoustic framework for manipulating audio model attributions while preserving predictions. This approach is innovative as it adapts adversarial attacks from the image domain to audio, incorporating perceptual metrics that are more relevant to human auditory perception. The methodology is well-structured, utilizing a combination of established XAI techniques (Grad-CAM and LRP) and new psychoacoustic constraints, which is a significant advancement in the field of audio explainability.
The experiments are comprehensive, utilizing a diverse set of architectures and a well-defined dataset (SONICS). The evaluation of the manipulation cost through perceptual audio quality metrics is particularly noteworthy, as it aligns the technical assessment with human auditory experience. The results clearly demonstrate the effectiveness of the proposed method in manipulating attribution maps while maintaining high audio fidelity, which is a crucial aspect for practical applications in audio deepfake detection.
The authors provide a GitHub repository with full code and configurations, which enhances reproducibility. However, the paper could benefit from more detailed documentation on the experimental setup and hyperparameter choices to facilitate easier replication of results by other researchers.
One limitation is the focus on specific architectures and datasets, which may not generalize across all audio models or applications. Additionally, while the psychoacoustic framework is innovative, the paper does not extensively discuss potential countermeasures against such attacks, which could be critical for real-world applications.
The findings have significant implications for the field of explainable AI, particularly in audio applications. By highlighting the vulnerabilities in current explanation methods, this research can inform the development of more robust and trustworthy audio classification systems. The work also raises ethical considerations regarding the potential misuse of adversarial techniques in manipulating model interpretations. This paper provides a crucial investigation into the vulnerabilities of audio deepfake detection systems, demonstrating that attribution maps can be manipulated while preserving predictions and audio quality. The innovative psychoacoustic approach and thorough experimental evaluation contribute significantly to the understanding of explainability in audio models, marking a step forward in the field.
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.
Primary: Amazon AGI
All Institutions: Amazon AGI, IIT Kharagpur
The main contribution of this paper is the introduction of ModeratorLM, a role-playing voice agent that enhances turn-taking in multi-party conversations through role conditioning and reasoning. This work represents a significant advancement in the field of conversational AI, addressing a critical challenge in multi-party interactions and providing a novel dataset for future research.
The proposed methodology introduces ModeratorLM, a role-playing voice agent that utilizes a speech large language model (LLM) to manage turn-taking in multi-party conversations. The approach is innovative in its use of role conditioning to influence turn-taking behavior, which is a significant advancement over traditional models that do not consider role dynamics. The integration of chain-of-thought reasoning in the ModeratorLM-Think variant adds an additional layer of sophistication, allowing the model to better interpret conversational context. The construction of the RolePlayConv dataset is also a notable contribution, as it provides a tailored resource for training and evaluating role-conditioned agents in multi-party settings. However, the reliance on synthetic data may raise questions about the generalizability of the findings.
The experiments conducted demonstrate a clear improvement in turn-taking precision and recall when using the ModeratorLM models compared to non-role-conditioned baselines. The use of both real-world meeting data and the synthetic RolePlayConv dataset strengthens the evaluation. The metrics reported, including precision, recall, F1-score, and reactive miss rate, provide a comprehensive view of the model's performance. The ablation studies further validate the importance of dynamic chunking and the role of reasoning in enhancing model performance. However, the lack of extensive human evaluations beyond the small-scale study may limit the robustness of the claims regarding role fidelity.
The paper provides a detailed description of the training and evaluation setup, including the architecture of the models, the dataset construction process, and the evaluation metrics. However, there is no mention of code or data availability, which is crucial for reproducibility in machine learning research. The absence of a demo or project URL also hinders the ability for others to replicate the work.
One significant limitation is the reliance on synthetic data for training the RolePlayConv dataset, which may not fully capture the complexities of real-world multi-party conversations. Additionally, while the model shows improved performance in turn-taking, it remains conservative, missing some valid response opportunities, which could affect user experience in practical applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The development of role-conditioned voice agents has the potential to significantly enhance the usability of conversational AI in various applications, such as virtual assistants, customer service, and collaborative tools. By improving turn-taking behavior, these agents can facilitate more natural and effective interactions in multi-party settings. However, ethical considerations regarding the deployment of such technology, especially in sensitive contexts, must be carefully evaluated. The main contribution of this paper is the introduction of ModeratorLM, a role-playing voice agent that enhances turn-taking in multi-party conversations through role conditioning and reasoning. This work represents a significant advancement in the field of conversational AI, addressing a critical challenge in multi-party interactions and providing a novel dataset for future research.
Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.
Primary: Huawei Technologies, China
All Institutions: Huawei Technologies, China
The main contribution of this paper is the development of an end-to-end model for multi-talker ASR that balances ASR and diarization tasks through innovative architecture and loss function design. This work represents a meaningful advancement in the field of speech recognition, particularly in handling overlapping speech, and demonstrates the potential of LLMs in improving speaker attribution accuracy.
The paper introduces a dual-encoder architecture that effectively extracts semantic and speaker features, employing innovative strategies such as feature interleaving and a length-aware speaker ID loss. The adaptive threshold strategy for ASR loss computation is particularly noteworthy, as it addresses the common issue of hallucinations in overlapping speech. The methodology is well-structured and demonstrates a clear understanding of the challenges in multi-talker ASR and diarization.
The experiments are comprehensive, utilizing two significant corpora (AliMeeting and Aishell4) to validate the proposed methods. The reported improvements over baseline systems are substantial, with relative gains of 18% and 24% in performance metrics. The evaluation metrics, including Character Error Rate (CER) and concatenated minimum-permutation character error rate (cpCER), are appropriate for assessing the effectiveness of the system.
The paper provides sufficient details about the model architecture, training process, and evaluation metrics, which facilitates reproducibility. However, the absence of publicly available code or datasets limits the ability of other researchers to replicate the findings fully.
One limitation is the reliance on limited real-recorded data, which may affect the generalizability of the model. Additionally, while the adaptive loss masking strategy shows promise, its effectiveness in more diverse or challenging datasets remains to be validated.
The proposed system has significant implications for real-world applications in multi-talker environments, such as meetings and conferences, where accurate speaker attribution is crucial. The integration of ASR and diarization in a unified model could enhance various applications, including automated transcription services and interactive voice response systems. The main contribution of this paper is the development of an end-to-end model for multi-talker ASR that balances ASR and diarization tasks through innovative architecture and loss function design. This work represents a meaningful advancement in the field of speech recognition, particularly in handling overlapping speech, and demonstrates the potential of LLMs in improving speaker attribution accuracy.
Large language model (LLM)-based text-to-speech (TTS) systems enable prompt-conditioned emotional control but struggle with fine-grained emotion intensity due to the semantic -- acoustic gap between text and speech. To address this challenge, we formulate emotion intensity control in LLM-based TTS as a learning-to-rank problem and propose Emo-LiPO, a listwise preference optimization framework that aligns prompt-conditioned speech generation with relative emotion intensity expressed in text. Emo-LiPO explicitly models global intensity ordering within each emotion under fixed transcripts, enabling more faithful and continuous emotional expression. We further construct ESD-plus, a multi-speaker dataset with explicit emotion intensity variations, to support fine-grained emotion modeling and evaluation. Experiments on ESD-plus demonstrate that Emo-LiPO significantly improves emotion accuracy and intensity controllability over both supervised- and DPO-based LLM TTS baselines, with particularly pronounced gains at high intensity levels.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Agency for Science, Technology and Research, National University of Singapore, Shenzhen Loop Area Institute, Shenzhen Research Institute of Big Data
The main contribution of this paper is the introduction of Emo-LiPO, a listwise preference optimization framework that significantly enhances fine-grained emotion intensity control in LLM-based TTS systems. This work addresses critical challenges in the field, providing a novel methodological approach and a valuable dataset for future research.
The paper proposes Emo-LiPO, a novel listwise preference optimization framework that reformulates emotion intensity control in LLM-based TTS as a learning-to-rank problem. This approach is innovative as it explicitly models global intensity ordering, addressing the semantic-acoustic gap in existing methods. The methodology is well-structured, including a comprehensive description of the problem formulation, the construction of the ESD-plus dataset, and the multi-stage optimization process. The use of a rule-based preference construction strategy for generating training data is a significant strength, as it allows for a more controlled and systematic evaluation of emotion intensity.
The experiments are robust, utilizing both automatic and human evaluations to assess the performance of Emo-LiPO against multiple baselines. The results demonstrate significant improvements in emotion accuracy and intensity controllability, particularly at higher intensity levels. The inclusion of various metrics for evaluation, such as WER, NISQA, and human preference comparisons, adds depth to the experimental assessment. The dataset ESD-plus is well-constructed, providing a solid foundation for evaluating the proposed method.
The paper provides a link to the GitHub repository containing the code, which is a positive aspect for reproducibility. However, detailed implementation specifics, such as hyperparameters and training configurations, are not fully disclosed in the text, which could pose challenges for other researchers attempting to replicate the results.
One limitation is the reliance on a single dataset (ESD-plus) for evaluation, which may affect the generalizability of the findings. Additionally, while the method shows improvements in emotion intensity control, the paper does not extensively discuss potential biases in the dataset or the implications of the rule-based preference construction strategy.
The Emo-LiPO framework has significant implications for the development of more expressive and controllable TTS systems, enhancing applications in areas such as virtual assistants, audiobooks, and entertainment. By improving fine-grained emotion intensity control, this research could lead to more engaging and human-like interactions in various audio applications. The main contribution of this paper is the introduction of Emo-LiPO, a listwise preference optimization framework that significantly enhances fine-grained emotion intensity control in LLM-based TTS systems. This work addresses critical challenges in the field, providing a novel methodological approach and a valuable dataset for future research.
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Carnegie Mellon University
The paper presents a significant advancement in low-latency spoken dialogue systems through the introduction of endpoint anticipation, which allows for proactive processing of user speech. This innovative approach, combined with a robust evaluation framework, positions the work as a valuable contribution to the field of audio and machine learning.
The paper introduces a novel approach to endpoint anticipation in spoken dialogue systems, shifting from reactive to proactive detection of end-of-turn signals. The dual-stream audio representation and the use of independent binary classification tasks for different anticipation horizons are well-structured and innovative. The proposed metrics for evaluating the trade-off between latency reduction and computational redundancy are a significant contribution to the field, allowing for a more nuanced understanding of system performance. The integration with the Unmute framework demonstrates practical applicability, although the paper could benefit from clearer explanations of the model architecture and training procedures.
The evaluation is thorough, utilizing two diverse datasets (SpokenWOZ and Switchboard) to assess the model's performance across various anticipation horizons. The results show a consistent improvement over the VAP baseline, with a notable average latency reduction of 505 ms. The introduction of specific metrics like Median Realized Anticipation and Expected Redundant Computation provides valuable insights into the model's efficiency and effectiveness. However, the paper could enhance its experimental rigor by including more comprehensive ablation studies to analyze the impact of different components of the model.
The authors mention that they will open-source their implementation, which is a positive step towards reproducibility. However, the paper lacks detailed information on hyperparameter tuning, model training specifics, and the exact configurations used in experiments, which could hinder replication efforts by other researchers.
One limitation is the reliance on specific datasets, which may not capture the full variability of real-world conversational speech. Additionally, while the model shows promise in structured dialogues, its performance in more spontaneous, open-domain conversations remains uncertain. The trade-off between latency reduction and computational redundancy, while quantified, may still lead to inefficiencies in certain scenarios, especially in longer dialogues.
The proposed framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interactions, such as virtual assistants and customer service bots. By enabling speculative execution of downstream processes, the model could enhance user experience in conversational AI, making interactions feel more natural and responsive. The open-source nature of the project may also foster further research and development in this area. The paper presents a significant advancement in low-latency spoken dialogue systems through the introduction of endpoint anticipation, which allows for proactive processing of user speech. This innovative approach, combined with a robust evaluation framework, positions the work as a valuable contribution to the field of audio and machine learning.
Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: https://github.com/Freyliu0516/audioPG.
Primary: East China Normal University
All Institutions: East China Normal University, Fudan University, Shanghai Jiao Tong University, Southeast University
The paper presents AudioPG, a procedural synthesis framework for audio representation learning that eliminates the need for real recordings. This innovative approach not only enhances efficiency and interpretability in audio learning but also opens new avenues for research in self-supervised learning and audio synthesis.
The methodology presented in the paper is innovative, utilizing a procedural audio synthesis framework (AudioPG) that generates audio waveforms on-the-fly without relying on real-world audio recordings. This approach leverages basic acoustic primitives and composition rules, allowing for a systematic exploration of audio representation learning. The use of a Transformer-based masked autoencoder to reconstruct log-Mel spectrograms is a well-established technique, but the novelty lies in the complete detachment from real data during pre-training, which is a significant advancement in the field. The detailed description of the procedural synthesizer and its components showcases a robust understanding of sound synthesis principles, enhancing the interpretability of the learned representations.
The experimental evaluation is thorough, with performance metrics reported on multiple real-world benchmarks (ESC-50, UrbanSound8K, FSD50K, and Speech Commands V2). The results demonstrate that the AudioPG framework achieves competitive accuracy levels, indicating effective transfer learning capabilities from synthetic to real audio tasks. The paper also includes an ablation study that quantifies the contributions of various synthesizer components, providing insights into the model's performance dynamics. However, the reliance on a single GPU for pre-training may limit the generalizability of the findings to larger-scale applications.
The paper includes a link to the code repository, which is essential for reproducibility. However, the details regarding the specific configurations and hyperparameters used during training and evaluation could be more explicit to facilitate easier replication of the results by other researchers. The description of the datasets and evaluation protocols is adequate, but clearer guidelines on the setup would enhance reproducibility.
One limitation identified in the study is the semantic gap between the physical attributes captured by the procedural generator and the high-level semantic categories required for accurate classification in real-world tasks. The model struggles with fine-grained distinctions in audio classification, particularly in cases where acoustic similarities lead to misclassifications. Additionally, the lack of high-level semantic modeling in the procedural synthesis may restrict its applicability in more complex audio understanding tasks.
The potential applications of this research are significant, particularly in scenarios where large-scale audio datasets are unavailable due to privacy or resource constraints. The procedural generation approach could democratize access to audio representation learning, enabling researchers and practitioners to develop models without the burden of extensive data curation. Furthermore, the insights gained from the latent space analysis may inform future work in audio synthesis and representation learning, bridging the gap between physical sound properties and semantic understanding. The paper presents AudioPG, a procedural synthesis framework for audio representation learning that eliminates the need for real recordings. This innovative approach not only enhances efficiency and interpretability in audio learning but also opens new avenues for research in self-supervised learning and audio synthesis.
Training neural networks (NNs) for speech enhancement (SE) in distant speech-capturing scenarios requires paired distorted and clean reference speech signals. While such data are often generated through simulation, the mismatch between simulated and real recordings significantly limits SE accuracy. To address this issue, we propose Close-to-Distant microphone Projection (C2D projection), a method that generates paired data from real recordings captured by close and distant microphones. C2D projection estimates an optimal projection matrix that transforms close-microphone inputs into clean reference signals aligned with distant-microphone recordings, while simultaneously performing denoising. We show this projection can be effectively realized using a variant of the Parametric Multichannel Wiener Filter (PMWF). Experimental results demonstrate that an NN trained with C2D-projected data outperforms the state-of-the-art Guided Source Separation (GSS) on the challenging CHiME6 dinner party ASR task under oracle diarization, when using the enhanced output from GSS as an auxiliary input to the NN.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel approach to generating training targets for speech enhancement in real-world scenarios, significantly improving upon existing methods. The technical contributions, particularly in the formulation of the C2D projection method and its empirical validation, highlight its potential impact on the field of audio processing and machine learning.
The proposed Close-to-Distant microphone Projection (C2D projection) method is a significant advancement in generating training targets for speech enhancement in real-world scenarios. The methodology effectively addresses the challenge of obtaining paired clean and distorted speech signals by leveraging recordings from close and distant microphones. The use of a projection matrix derived from a variant of the Parametric Multichannel Wiener Filter (PMWF) is innovative, as it allows for simultaneous denoising and alignment of signals, which is crucial for training neural networks. The paper provides a clear mathematical formulation and rationale for the method, making it accessible for replication and further research.
The experimental evaluation is robust, utilizing the CHiME6 and CHiME8 datasets, which are well-regarded benchmarks in the field of speech enhancement and automatic speech recognition. The results demonstrate that the C2D projection method outperforms the state-of-the-art Guided Source Separation (GSS) approach under both matched and mismatched conditions. The use of objective metrics such as tcpWER and DNSMOS adds credibility to the findings. However, the paper could benefit from additional qualitative assessments or user studies to further validate the improvements in speech intelligibility and quality.
The paper includes sufficient detail regarding the implementation of the C2D projection method and the training of the neural network, referencing publicly available code for the model. However, the reproducibility could be enhanced by providing more explicit details on the training process, hyperparameters, and the specific configurations used for the experiments, as well as making the generated datasets available.
One limitation noted is the reliance on oracle diarization labels during training, which may not be feasible in practical applications. Additionally, while the method shows robustness against some mismatches in training and test conditions, there are scenarios where performance degradation is observed, indicating that further work is needed to enhance the method's adaptability to diverse environments.
The C2D projection method has significant implications for real-world applications in speech enhancement, particularly in environments where distant microphones are used, such as in meetings or public speaking events. The ability to generate high-quality training targets from real recordings can lead to improved performance in automatic speech recognition systems, potentially enhancing user experiences in various audio-based applications. The findings could also inspire further research into novel training techniques for other audio processing tasks. The paper presents a novel approach to generating training targets for speech enhancement in real-world scenarios, significantly improving upon existing methods. The technical contributions, particularly in the formulation of the C2D projection method and its empirical validation, highlight its potential impact on the field of audio processing and machine learning.
Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.
Primary: Tsinghua University
All Institutions: Tsinghua University, Pengcheng Laboratory
The paper presents self-guidance, a novel training mechanism that enhances the fidelity of neural speech codecs by aligning decoder outputs for quantized and continuous latent representations. This contribution is significant as it addresses a critical bottleneck in audio coding, providing a practical solution that improves reconstruction quality while simplifying downstream language modeling tasks.
The proposed self-guidance mechanism introduces a novel approach to enhance the robustness of VQ-VAE-based neural speech codecs against quantization artifacts. By aligning the decoder's internal feature manifolds through a lightweight feature-mapping loss, the methodology effectively mitigates the impact of quantization error without requiring significant changes to the model architecture or inference process. This innovative approach is well-justified and supported by thorough theoretical grounding and empirical validation.
The experiments are comprehensive, utilizing the LibriSpeech dataset to evaluate reconstruction performance across various codebook sizes and quantization methods. The results demonstrate significant improvements in reconstruction metrics, establishing state-of-the-art performance for low-bitrate speech codecs. The inclusion of subjective evaluations further strengthens the findings, providing a well-rounded assessment of the proposed method's efficacy.
The paper provides sufficient implementation details, including model configurations and training procedures, which facilitate reproducibility. The use of open-source code for the baseline models enhances transparency and allows for independent verification of results.
While the self-guidance mechanism shows promise, the paper acknowledges that it does not completely eliminate quantization artifacts, indicating that some residual distortion may persist. Additionally, the validation is primarily focused on neural speech codecs, and the applicability to other audio domains remains to be explored.
The proposed method has significant implications for improving audio compression technologies, potentially enhancing accessibility in telecommunications and low-bandwidth applications. However, the potential for misuse in generating deceptive audio content must be considered, emphasizing the need for responsible deployment of such technologies. The paper presents self-guidance, a novel training mechanism that enhances the fidelity of neural speech codecs by aligning decoder outputs for quantized and continuous latent representations. This contribution is significant as it addresses a critical bottleneck in audio coding, providing a practical solution that improves reconstruction quality while simplifying downstream language modeling tasks.
Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains increasing agency transitioning from reactive to more proactive roles. The work employs four independent loopers in Max/MSP to function as 'virtual' drones. They are populated cyclically in real-time as the vocalist improvises, creating an organic and evolving feedback loop between the voice and the virtual drone. This relationship further evolves melodically by pitch shifting the loops, which introduces a dimension of sudden, explicit movement. Then it changes timbrally, via the integration of GaMaDHaNi, a singer conditioned pitch-to-voice generative AI model to resynthesize looped audio. While current music AI approaches prioritize high-fidelity and realism of generated content which has sparked anxiety over job replacement for the music community, this work intentionally utilizes low-fidelity generative outputs, further necessitating human interpretation and situational context in order to be complete. 'The Moving Drone' positions technology and generative AI within established socio-cultural musical practices, proposing a virtual drone as an active, responsive, and co-creative musical agent.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology, Harvard University
The main contribution of this paper is the innovative exploration of agency in music through the integration of generative AI and traditional Hindustani music practices. This work significantly advances the discourse on the role of technology in artistic expression, proposing a model where AI acts as a collaborative partner rather than a mere tool, thus enriching the creative landscape in music.
The methodology employed in "The Moving Drone" is innovative in its integration of traditional Hindustani music with modern technology, specifically through the use of Max/MSP and generative AI. The paper outlines a clear structure for the performance, detailing how the drone's agency is manipulated through three distinct movements. Each movement explores different aspects of musical improvisation and interaction with technology, showcasing a thoughtful approach to blending human creativity with AI-generated outputs. The use of pitch shifting and the GaMaDHaNi model to resynthesize audio adds depth to the methodology, allowing for a dynamic interaction between the vocalist and the virtual drone.
The paper does not provide traditional experimental results or quantitative metrics commonly found in machine learning research. Instead, it focuses on a performance-based evaluation, which is appropriate given the artistic nature of the work. The description of the three movements serves as a qualitative assessment of the system's capabilities. However, the lack of formalized evaluation metrics (e.g., listener studies or comparative analysis) limits the ability to rigorously assess the technical impact of the system.
The paper includes some implementation details, such as the use of Max/MSP and the specific setup for the performance, but it lacks comprehensive documentation that would allow for full reproducibility. There are references to figures and technical sheets that are not provided in the text, which would be necessary for others to replicate the setup and results.
One significant limitation is the reliance on a single performance context, which may not generalize to other musical settings or styles. Additionally, the paper acknowledges that the theoretical framework is still a work in progress, indicating that the full potential of the proposed methods has not yet been realized. The use of low-fidelity generative outputs may also limit the appeal to audiences accustomed to high-fidelity music production.
The work has potential implications for the intersection of AI and music, particularly in how generative models can be integrated into traditional music practices without displacing human musicians. By advocating for a more nuanced understanding of AI's role in music creation, the paper contributes to ongoing discussions about the ethical and cultural ramifications of AI in the arts. It also opens avenues for further exploration of AI as a co-creative partner rather than a replacement for human artists. The main contribution of this paper is the innovative exploration of agency in music through the integration of generative AI and traditional Hindustani music practices. This work significantly advances the discourse on the role of technology in artistic expression, proposing a model where AI acts as a collaborative partner rather than a mere tool, thus enriching the creative landscape in music.
Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.
Primary: The Hong Kong University of Science and Technology
All Institutions: The Hong Kong University of Science and Technology, Tsinghua University, Noiz AI, Independent Researcher
The main contribution of this work is the introduction of AudioX-Turbo, a unified framework for efficient anything-to-audio generation that significantly reduces inference costs while maintaining high-quality output across multiple modalities. This work represents a meaningful advancement in the field of audio generation, combining innovative methodologies with practical applications.
The methodology presented in this paper is robust, utilizing a teacher-student paradigm to enhance efficiency in audio generation. The integration of a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module is particularly noteworthy, as it allows for the alignment of diverse multimodal inputs, which is crucial for high-fidelity audio synthesis. The proposed Distribution Matching Distillation method is innovative and effectively reduces the inference cost associated with multi-step diffusion sampling. The two-stage data construction pipeline for creating a large-scale dataset is also a significant contribution, addressing the common issue of data scarcity in multimodal training.
The experiments are comprehensive, benchmarking AudioX-Turbo against state-of-the-art methods across various tasks, including text-to-audio and video-to-audio generation. The results demonstrate that the proposed model achieves superior performance while significantly reducing the number of function evaluations required for inference. The use of both subjective and objective evaluation metrics strengthens the credibility of the findings. However, the paper could benefit from more detailed comparisons with a broader range of existing models to fully contextualize its performance.
The paper provides a clear outline of the implementation details, including architecture specifications, training protocols, and evaluation metrics. The availability of the code and datasets is a positive aspect that enhances reproducibility. However, the paper could improve by including more specific hyperparameter settings and training configurations to facilitate easier replication of results by other researchers.
One limitation is the reliance on a large-scale dataset, which may not be readily available for all researchers. Additionally, while the model shows impressive performance, there may be edge cases or specific scenarios where the model's generalization capabilities could be further tested. The paper does not fully address potential biases in the dataset or the implications of using large-scale models in real-world applications.
The implications of this research are significant, as it opens new avenues for automated audio generation in various fields, including entertainment, gaming, and content creation. The ability to generate high-quality audio from diverse multimodal inputs can enhance user experiences and streamline production processes. Furthermore, the findings may inspire future research in multimodal AI systems and their applications in other domains. The main contribution of this work is the introduction of AudioX-Turbo, a unified framework for efficient anything-to-audio generation that significantly reduces inference costs while maintaining high-quality output across multiple modalities. This work represents a meaningful advancement in the field of audio generation, combining innovative methodologies with practical applications.
Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.
Primary: Thales SIX GTS
All Institutions: Thales SIX GTS
BASENet introduces a frequency-adapted speech enhancement network that effectively allocates encoder depth based on auditory principles, achieving a strong balance between performance and computational efficiency. This work represents a meaningful contribution to the field of audio processing, particularly in enhancing speech intelligibility in challenging acoustic environments.
The methodology presented in BASENet is innovative, leveraging perceptual principles from the human auditory system to inform the architecture's design. The use of Bark-scale bands for frequency adaptation and the introduction of a cross-band attention mechanism are significant advancements. The architecture's ability to dynamically allocate encoder capacity based on critical-band density is a novel approach that addresses limitations in existing models that apply uniform capacity across the frequency spectrum. The integration of causal processing for real-time applications further enhances its applicability in practical scenarios, such as hearing aids.
The experiments are well-structured, utilizing the VoiceBank+DEMAND dataset, which is a standard benchmark for speech enhancement tasks. The reported results demonstrate that BASENet achieves competitive performance metrics, specifically a PESQ score of 3.55 with significantly fewer parameters than comparable models. The ablation studies provide valuable insights into the contributions of various components of the architecture, reinforcing the importance of the proposed methods. However, the paper could benefit from additional comparisons with more recent state-of-the-art models to further contextualize its performance.
The paper provides sufficient implementation details, including architecture specifications, training procedures, and hyperparameters, which facilitate reproducibility. However, the absence of a publicly available code repository or demo limits the ability for independent verification of results. Including such resources would significantly enhance the paper's reproducibility.
One limitation is the lack of a comprehensive evaluation against a wider array of state-of-the-art models, particularly those that utilize more advanced techniques such as self-supervised learning or generative models. Additionally, while the model is lightweight, its performance in extremely noisy conditions or with diverse accents is not thoroughly explored, which could affect its generalizability.
The implications of BASENet are significant, particularly in the fields of assistive technology and real-time communication systems. By improving speech enhancement in resource-constrained environments, this work could enhance accessibility for individuals with hearing impairments and improve clarity in voice communication applications. The model's design principles could also inspire future research in audio processing, particularly in leveraging perceptual characteristics for model optimization. BASENet introduces a frequency-adapted speech enhancement network that effectively allocates encoder depth based on auditory principles, achieving a strong balance between performance and computational efficiency. This work represents a meaningful contribution to the field of audio processing, particularly in enhancing speech intelligibility in challenging acoustic environments.
Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: https://avery-xu.github.io/ECC-demo/
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The main contribution of this paper is the introduction of ECC, a novel Entropy-Constrained Codec that significantly improves low-bitrate speech compression through joint optimization of representation learning and probability modeling. The comprehensive benchmarking and innovative methodologies presented in this work mark a significant advancement in the field of neural speech codecs, addressing critical challenges in efficient speech representation and transmission.
The paper presents a comprehensive methodology for neural speech compression through the proposed Entropy-Constrained Codec (ECC). The methodology integrates scalar quantization with a learned entropy model, emphasizing the importance of joint optimization of representation learning and probability modeling. The use of hyperprior-based side information, channel-wise context modeling, and latent residual prediction demonstrates a sophisticated approach to improve the rate-distortion trade-off. The introduction of entropy skip to omit predictable residual symbols is a noteworthy innovation that enhances efficiency without additional signaling. The unified formulation and benchmarking of existing codecs provide a solid foundation for understanding the advancements in the field.
The experiments conducted are extensive, involving both objective and subjective evaluations across multiple datasets, including LibriTTS and VCTK. The results indicate that ECC significantly outperforms conventional and recent neural codec baselines in terms of BD-rate reductions and perceptual quality metrics. The use of multiple evaluation metrics (ViSQOL, PESQ, STOI, etc.) strengthens the reliability of the findings. Ablation studies further validate the effectiveness of the proposed entropy modeling and architectural choices, showcasing a rigorous experimental design.
The paper provides detailed descriptions of the experimental setup, including datasets, training procedures, and evaluation metrics. However, the lack of a publicly available code repository limits reproducibility. While the methodology is well-documented, the absence of implementation details may pose challenges for other researchers attempting to replicate the results.
Some limitations include the reliance on specific datasets for training and evaluation, which may affect the generalizability of the results to other speech domains or languages. Additionally, the complexity of the ECC model may hinder its deployment in real-time applications due to computational demands. The paper could also benefit from a discussion on the trade-offs between model complexity and performance.
The advancements in neural speech compression presented in this paper have significant implications for low-bitrate communication systems, particularly in mobile and real-time applications. The proposed methods could enhance the quality of speech transmission in constrained environments, benefiting various industries, including telecommunications and streaming services. The focus on entropy-constrained coding could inspire further research into efficient coding strategies across different audio and speech processing tasks. The main contribution of this paper is the introduction of ECC, a novel Entropy-Constrained Codec that significantly improves low-bitrate speech compression through joint optimization of representation learning and probability modeling. The comprehensive benchmarking and innovative methodologies presented in this work mark a significant advancement in the field of neural speech codecs, addressing critical challenges in efficient speech representation and transmission.
Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this paper is the introduction of interleaved stacking for SFM distillation, which preserves layer-specific knowledge and enhances training efficiency. This work significantly advances the field of speech processing by providing a novel approach to knowledge distillation that addresses existing limitations and demonstrates strong empirical results.
The paper introduces a novel stacking method termed "interleaved stacking" for the distillation of speech foundation models (SFMs). This approach addresses the limitations of existing stacking methods by maintaining consistent layer positions throughout the training process, which is crucial for preserving layer-specific knowledge in SFMs. The methodology is well-structured, with a clear explanation of how interleaved stacking differs from traditional stacking methods and the rationale behind it. The integration of intermediate-level knowledge distillation losses further enhances the proposed method's effectiveness, demonstrating a thoughtful consideration of the challenges in knowledge distillation.
The experiments are robust, utilizing the SUPERB benchmark to validate the proposed method across various speech processing tasks. The results indicate that interleaved stacking outperforms existing methods significantly, showcasing improvements in performance metrics such as phoneme error rate (PER) and word error rate (WER). The paper also includes a comparative analysis against models trained without stacking, reinforcing the advantages of the proposed approach. However, the paper could benefit from additional details on the experimental setup, such as hyperparameter tuning and the specific configurations used for the training process.
The paper provides a reasonable level of detail regarding the experimental setup, including model architectures, training parameters, and evaluation metrics. However, the lack of a publicly accessible code repository or demo URL limits the reproducibility of the results. Future work should consider making the code available to facilitate validation of the findings by the research community.
One limitation of the study is the potential overfitting to the SUPERB benchmark, which may not fully represent the diversity of real-world speech processing tasks. Additionally, while the proposed method shows significant improvements, it remains to be seen how it performs in more complex scenarios or with different types of speech data. The paper also does not address the computational cost of implementing interleaved stacking compared to traditional methods, which could be a consideration for practical applications.
The proposed method has significant implications for deploying efficient speech processing models in low-resource environments, making it particularly relevant for applications in real-time speech recognition and natural language processing. By improving the training efficiency of SFMs, the research contributes to the broader goal of making advanced machine learning technologies more accessible and practical for various applications. The main contribution of this paper is the introduction of interleaved stacking for SFM distillation, which preserves layer-specific knowledge and enhances training efficiency. This work significantly advances the field of speech processing by providing a novel approach to knowledge distillation that addresses existing limitations and demonstrates strong empirical results.
Audio watermarking aims to embed identifiable information into audio while remaining imperceptible. Existing methods adopt high-fidelity, low-energy designs to preserve perceptual quality, but the resulting watermarks lack robustness under suppression by speech reconstruction models. Improving robustness is challenging due to the inherent robustness-fidelity trade-off in existing designs, where increasing watermark energy improves robustness but reduces fidelity. To address this problem, we propose a feature-aligned watermarking method that aligns the watermark with the original speech feature distribution, allowing higher watermark energy to improve robustness while preserving imperceptibility. We use a pretrained speech codec to generate a pseudo-speech watermark and fuse it into the spectrogram of the input audio, with VAD loss and perceptual losses guiding embedding within voiced regions. Experiments show that our method maintains imperceptibility comparable to existing approaches while substantially improving robustness under both seen and unseen speech reconstruction models.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, Pengcheng Laboratory, Independent Researcher
The main contribution of this paper is the development of a feature-aligned watermarking method that effectively balances robustness and imperceptibility in audio watermarking. This work significantly advances the field of audio processing by addressing the critical challenges posed by speech reconstruction models, providing a robust solution that maintains audio quality and imperceptibility.
The proposed methodology introduces a feature-aligned watermarking approach that leverages a pretrained speech codec to generate a pseudo-speech watermark, which is then embedded into the audio spectrogram. The integration of voice activity detection (VAD) loss and perceptual losses is a significant enhancement, allowing the watermark to be embedded within voiced regions, thus maintaining imperceptibility while improving robustness against various speech reconstruction models. The architecture is well-structured, with clear delineation between the embedder and decoder components, and the use of a feature pyramid for watermark extraction is innovative and well-justified.
The experiments are comprehensive, utilizing multiple datasets (VCTK, LibriSpeech, LJSpeech) and a variety of speech reconstruction models to assess robustness. The evaluation metrics, including bit-wise accuracy (ACC) and false attribution rate (FAR), are appropriate for the task. The subjective ABX tests and VISQOL MOS scores provide a solid basis for assessing perceptual quality, demonstrating that the proposed method achieves competitive results compared to existing watermarking techniques. The ablation studies further validate the contributions of specific components of the methodology.
The paper provides sufficient implementation details, including model architecture, training protocols, and loss functions, which facilitate reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility for some researchers.
One limitation is the potential degradation in fidelity when embedding higher energy watermarks, which, while addressed, may still affect certain applications where audio quality is paramount. Additionally, the method's performance under extreme distortions or aggressive compression could be further explored, as indicated by some performance drops in the experiments.
The proposed watermarking technique has significant implications for copyright protection and content attribution in modern audio applications, especially in environments where speech reconstruction is prevalent, such as voice calls and online meetings. The ability to maintain imperceptibility while enhancing robustness could lead to wider adoption of watermarking technologies in commercial applications. The main contribution of this paper is the development of a feature-aligned watermarking method that effectively balances robustness and imperceptibility in audio watermarking. This work significantly advances the field of audio processing by addressing the critical challenges posed by speech reconstruction models, providing a robust solution that maintains audio quality and imperceptibility.
Recent respiratory sound classification (RSC) studies largely rely on CLS-token driven self-attention architectures such as the Audio Spectrogram Transformer (AST). While effective at modeling global context, recent analyses suggest a low-pass filtering behavior that may reduce sensitivity to localized abnormal patterns. In this work, we investigate State Space Models (SSMs) as an alternative backbone for RSC. Using the Distilled Audio State Space model, we analyze intermediate representations through spectral response curves and observe stronger preservation of mid-to-high spatial-frequency components. Based on these observations, we introduce spectral-aware layer regularization using Gaussian convolution applied to selected layers. We further propose Dual-Axis Patch-Mix contrastive learning tailored to SSM-based audio models for robust representation learning. Experiments on the ICBHI benchmark show that our approach achieves 64.48% score, outperforming the AST baseline by 5%. Code is available at https://github.com/RSC-Toolkit/Lung-SRAD.
Primary: RSC LAB, MODULABS, Republic of Korea
All Institutions: RSC LAB, MODULABS, Republic of Korea; Department of Electronic Engineering, Wonkwang University, Republic of Korea; AI Convergence Research Institute, Wonkwang University, Republic of Korea
This paper introduces a novel approach to respiratory sound classification using State Space Models, addressing key limitations of existing Transformer-based methods. The technical contributions, including spectral-aware regularization and contrastive learning, are well-founded and demonstrate potential for significant impact in the field of medical audio analysis.
The paper presents a novel approach to respiratory sound classification (RSC) by leveraging State Space Models (SSMs) as an alternative to traditional Transformer architectures. The introduction of spectral-aware layer regularization and Dual-Axis Patch-Mix contrastive learning is well-motivated and addresses specific limitations of existing methods, particularly the low-pass filtering behavior of self-attention mechanisms. The methodology is clearly articulated, with a strong theoretical foundation and empirical validation of the proposed techniques.
The experiments are conducted on the ICBHI benchmark, which is a relevant dataset for RSC. The results demonstrate a clear improvement over the baseline Audio Spectrogram Transformer (AST) model, achieving a score of 64.48%. The paper provides a thorough analysis of the performance metrics, including sensitivity and specificity, which are crucial for medical applications. However, the paper could benefit from additional comparisons with more recent models in the field to further contextualize its contributions.
The authors provide sufficient details regarding the experimental setup, including training parameters and evaluation metrics, which enhances reproducibility. The availability of the code on GitHub is a positive aspect, enabling other researchers to replicate the findings.
One limitation of the study is the reliance on a single dataset (ICBHI), which may affect the generalizability of the results. Additionally, while the proposed methods show improvements, the paper does not explore the potential trade-offs in computational efficiency or model complexity compared to existing architectures.
The proposed method has significant implications for the field of respiratory sound classification, which is critical for diagnosing various respiratory conditions. By improving the sensitivity to abnormal lung sounds, this research could enhance clinical decision-making and patient outcomes in respiratory health. This paper introduces a novel approach to respiratory sound classification using State Space Models, addressing key limitations of existing Transformer-based methods. The technical contributions, including spectral-aware regularization and contrastive learning, are well-founded and demonstrate potential for significant impact in the field of medical audio analysis.
Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.
Primary: National University of Singapore
All Institutions: National University of Singapore, TikTok
The paper presents a novel framework for reproduced content identification that leverages multimodal large language models, demonstrating significant improvements in efficiency and accuracy for real-time applications in content moderation. The technical contributions, particularly in knowledge distillation and multimodal integration, are poised to impact the field of machine learning and content governance significantly.
The methodology presented in this paper is robust, employing a two-stage training framework that effectively utilizes knowledge distillation to transfer the capabilities of a high-capacity multimodal large language model (MLLM) to a lightweight model suitable for real-time applications. The joint modeling of video, audio, and text signals is a significant advancement, as it addresses the limitations of existing methods that primarily focus on visual encoders. The architecture design of both MatchLM and MatchLite is well thought out, allowing for efficient processing and accurate reproduction identification through a multimodal approach.
The experimental evaluation is thorough, with extensive offline experiments and online A/B testing on a large-scale platform. The results demonstrate significant improvements in F1 scores and computational efficiency, validating the effectiveness of the proposed system in a real-world setting. The paper provides detailed comparisons with baseline models and ablation studies that highlight the contributions of different components, enhancing the credibility of the findings.
While the paper provides a comprehensive description of the models and training processes, it lacks specific URLs or repositories for code and data, which could hinder reproducibility. The absence of publicly available benchmarks or datasets also limits the ability of other researchers to replicate the results independently.
One limitation is the reliance on proprietary data and the lack of a publicly available dataset for reproduced content identification, which restricts broader validation of the proposed methods. Additionally, the paper does not address potential biases in the training data or the implications of deploying such a system at scale, which could affect fairness and accuracy in content moderation.
The proposed MatchLM2Lite framework has significant implications for content moderation in online platforms, potentially improving user experiences by reducing the prevalence of reproduced content. The integration of multimodal signals could enhance the understanding of content authenticity, benefiting creators and users alike. However, the ethical considerations surrounding automated content moderation and the potential for overreach in content filtering must be carefully managed. The paper presents a novel framework for reproduced content identification that leverages multimodal large language models, demonstrating significant improvements in efficiency and accuracy for real-time applications in content moderation. The technical contributions, particularly in knowledge distillation and multimodal integration, are poised to impact the field of machine learning and content governance significantly.
Language models (LMs) have become one of the most prominent paradigms in modern generative modeling. While making them faster has been the main focus of real-time deployment, speed alone is not enough. Many real-world applications, such as synchronized translation and voice synthesis, also require precise alignment between generation and external signals, both in terms of generation content and timing. We refer to this problem as \textit{frame-synchronous streaming inference}. To address it, we present StreamMUSE, an inference system that performs LM generation in response to an external signal stream within a client-server architecture. The client continuously sends high-frequency inference requests based on the most recent inputs and receives outputs synchronized to the external clock, while the server executes model inference. We demonstrate the framework through a live music accompaniment task, showing how real-time synchronization can be achieved across different deployment environments with varying round-trip latencies. We further model the relationship between system hyperparameters and round-trip latency, and evaluate how different environments affect optimal configurations to achieve real-time performance. Experimental results show a consistent correspondence between system real-time performance and music quality, demonstrating the effectiveness of the proposed framework. The project is open source. Relevant code and the latest updates are available at https://stream-muse-webpage.vercel.app/#audio-library.
Primary: Mohamed bin Zayed University of Artificial Intelligence
All Institutions: University of Science and Technology of China, Mohamed bin Zayed University of Artificial Intelligence, University of California, San Diego, Wuhan University, New York University
The main contribution of this paper is the introduction of a novel real-time language model inference system tailored for live music accompaniment generation, demonstrating the feasibility of frame-synchronous streaming inference in a client-server architecture. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for impactful advancements in interactive music systems.
The paper presents a novel approach to real-time language model inference in the context of live music accompaniment generation. The authors introduce a client-server architecture that allows for frame-synchronous streaming inference, which is essential for maintaining musical coherence and timing. The methodology is well-structured, focusing on the interplay between inference intervals and generation lengths to optimize responsiveness and quality. The use of a tick-based system for temporal resolution is particularly innovative, allowing for precise synchronization with musical elements. The mathematical modeling of round-trip latency and its impact on system performance is a significant contribution, providing a theoretical foundation for the practical implementation of their system.
The experimental setup is robust, with evaluations conducted across three different environments (local, local-server, and remote-server) to assess the system's performance under varying conditions. The metrics used for evaluation, including Interaction Success Rate (ISR), Staleness, and various music quality metrics (JSD, FMD, CR, UR), are comprehensive and relevant. The results demonstrate a clear correlation between system responsiveness and music quality, validating the proposed framework. However, while the experiments are thorough, additional comparisons with existing state-of-the-art systems could strengthen the claims of superiority.
The paper provides sufficient detail regarding the implementation, including the architecture, training details, and evaluation metrics. The open-source nature of the project, with a dedicated URL for accessing the code and updates, enhances reproducibility. However, the paper could benefit from more explicit instructions or a README file in the repository to facilitate easier replication of the experiments.
One limitation is the reliance on specific datasets (e.g., POP909) for training and evaluation, which may affect the generalizability of the results. Additionally, the system's performance may vary significantly with different musical genres or styles, which is not thoroughly explored in the paper. The impact of network conditions on real-time performance could also be more extensively analyzed, particularly in real-world scenarios.
The proposed system has the potential to revolutionize live music performance by enabling real-time accompaniment generation that is both musically coherent and responsive to live input. This could have significant implications for musicians, educators, and the entertainment industry, enhancing collaborative performances and interactive music experiences. The framework also opens avenues for further research in real-time generative models across various domains beyond music. The main contribution of this paper is the introduction of a novel real-time language model inference system tailored for live music accompaniment generation, demonstrating the feasibility of frame-synchronous streaming inference in a client-server architecture. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for impactful advancements in interactive music systems.
Zero-shot text-to-speech (TTS) relies on robust speech representations. However, current speech tokenizers face a fundamental trade-off: acoustic codecs preserve high-fidelity audio but lack linguistic constraints, causing content errors during generation, whereas semantic tokens from self-supervised learning (SSL) models ensure precise text alignment but discard some acoustic information. To bridge this gap, we propose SARA, a dual-stream VAE that directly fuses a frozen SSL semantic anchor with a dedicated residual acoustic encoder. This effectively mitigates the dilemma, creating an efficient and compact latent space without relying on complex regularizers. SARA achieves superior reconstruction quality over strong baselines. Furthermore, in downstream zero-shot TTS tasks, it yields highly natural and expressive synthesis quality, and maintains robust generation performance even under accelerated inference, offering a favorable trade-off between synthesis speed and computational cost.
Primary: Xiamen University
All Institutions: Xiamen University, DiDi Global Inc.
The main contribution of this paper is the introduction of SARA, a dual-stream VAE that effectively integrates semantic and acoustic representations for improved zero-shot TTS performance. This innovative approach addresses existing challenges in speech synthesis, offering a promising direction for future research in high-fidelity speech generation.
The paper introduces SARA, a dual-stream variational autoencoder (VAE) that integrates semantic and acoustic representations to address the trade-off between reconstruction fidelity and generative controllability in zero-shot text-to-speech (TTS) systems. The methodology is well-structured, leveraging a frozen self-supervised learning (SSL) model for semantic encoding and a residual acoustic encoder for capturing detailed acoustic features. This architectural innovation allows for efficient integration of both streams without the need for complex regularization, which is a notable improvement over existing methods. The use of adversarial training to enhance perceptual quality further strengthens the approach.
The experimental evaluation is robust, utilizing extensive datasets such as LibriTTS and LibriHeavy for training and testing. The authors provide a comprehensive comparison against strong baselines, demonstrating SARA's superior performance in terms of reconstruction fidelity and downstream TTS tasks. The metrics used, including PESQ, STOI, WER, and subjective evaluations like CMOS and SMOS, are appropriate for assessing both objective and perceptual quality. The results indicate significant improvements in content accuracy and speaker similarity, validating the effectiveness of the proposed framework.
The paper provides detailed implementation specifics, including training configurations, dataset descriptions, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The authors could improve reproducibility by sharing their code and trained models.
One limitation is the reliance on a frozen SSL model, which may limit adaptability to new or diverse datasets. Additionally, while the dual-stream architecture effectively mitigates the semantic-acoustic trade-off, it introduces additional complexity that may not be necessary for all applications. The paper also does not explore the scalability of the model in multilingual settings, which could be a significant area for future research.
The advancements presented in SARA have the potential to significantly enhance the quality of TTS systems, making them more applicable in various domains such as virtual assistants, audiobooks, and accessibility technologies. The ability to generate high-fidelity speech with accurate content representation could improve user experiences and broaden the reach of TTS applications. The main contribution of this paper is the introduction of SARA, a dual-stream VAE that effectively integrates semantic and acoustic representations for improved zero-shot TTS performance. This innovative approach addresses existing challenges in speech synthesis, offering a promising direction for future research in high-fidelity speech generation.
Precise note-level annotations are critical for training automatic music transcription (AMT) systems, in particular note-onset labels, which form a core component of many recent AMT systems. However, high-quality annotations for real-world recordings are scarce. Sequence-level score--audio alignment methods such as dynamic time warping provide only coarse correspondence, making a local refinement step necessary. This refinement step, known as snapping, adjusts aligned score onsets using peaks in a neural onset posteriorgram and often determines whether weakly aligned score--audio pairs become usable training data at all. Despite its practical importance, snapping is typically treated as a simple post-processing heuristic and implemented with greedy local decisions. We present a systematic analysis of snapping strategies for training instrument-agnostic transcribers, demonstrating that snapping is essential for learning from weakly aligned data. Building on this, we formulate snapping as a per-pitch assignment problem and solve it via bipartite graph matching, yielding context-aware onset decisions under overlapping refinement windows and uncertain initial alignments. Extensive cross-dataset experiments across piano, chamber, and orchestral recordings show improved onset alignment and transcription accuracy over greedy snapping, with gains increasing for wider snapping windows and coarser initial alignments. Qualitative examples are provided on our project page: https://abhirupsaha8.github.io
Primary: International Audio Laboratories Erlangen
All Institutions: International Audio Laboratories Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Fraunhofer Institute for Integrated Circuits IIS
The main contribution of this paper is the introduction of a graph-based approach to snapping in automatic music transcription, which improves alignment accuracy and transcription performance. This work represents a significant step forward in addressing the challenges of note-onset detection in complex musical recordings, providing a robust framework that can be applied to various instruments and ensemble types.
The paper presents a novel approach to refining note-onset alignments in automatic music transcription by framing the snapping process as a bipartite graph matching problem. This method effectively addresses the limitations of traditional greedy approaches by ensuring global consistency and robustness against overlapping refinement windows. The systematic analysis of snapping strategies and the formulation of the problem demonstrate a clear advancement in the methodology of AMT.
The experiments are comprehensive, utilizing multiple datasets (MusicNet, MAESTRO, URMP, etc.) to validate the proposed method across different musical contexts. The results indicate significant improvements in transcription accuracy and onset alignment, especially under conditions of coarse initial alignments. The evaluation metrics are well-defined, focusing on note-level precision, recall, and F1 scores, which are appropriate for the task.
The paper provides sufficient detail on the methodology and experimental setup, allowing for reproducibility. However, specific implementation details, such as the exact configurations of the bipartite matching algorithms used, could be more thoroughly documented to enhance reproducibility further.
While the proposed method shows promise, it may still struggle with highly complex orchestral pieces where overlapping notes are more frequent. Additionally, the reliance on neural onset posteriorgrams could introduce variability based on the quality of the underlying models used for this task.
The advancements in this paper could significantly enhance the field of music information retrieval, particularly in automatic music transcription, by enabling more accurate and reliable systems. This could have implications for music education, music analysis, and the development of music-related applications that rely on precise note recognition. The main contribution of this paper is the introduction of a graph-based approach to snapping in automatic music transcription, which improves alignment accuracy and transcription performance. This work represents a significant step forward in addressing the challenges of note-onset detection in complex musical recordings, providing a robust framework that can be applied to various instruments and ensemble types.
Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.
Primary: NTT, Inc., Japan
All Institutions: NTT, Inc., Japan
The main contribution of this paper is the introduction of a novel method for generating tight boundary predictions in speaker diarization using causal-anticausal consistency, which significantly improves the performance of models trained on loosely annotated data. The comprehensive analysis of the methodology and experimental results underscores its potential to advance the field of audio processing and speaker diarization.
The proposed methodology effectively addresses the challenge of producing tight predictions from loose annotations in speaker diarization. By leveraging causal and anticausal models, the authors ingeniously isolate the functionalities of detecting speech segments and managing boundary conditions. The co-training scheme is a notable innovation that facilitates progressive refinement of labels, enhancing the model's ability to produce tighter outputs. The approach is well-structured and justified, with a clear rationale for each step in the methodology.
The experiments conducted are thorough and demonstrate the effectiveness of the proposed methods across various datasets. The results indicate a significant reduction in diarization error rates (DER) when using the proposed tightening methods compared to baseline models. The inclusion of multiple tightening strategies (basic, VAD, SC) and their comparative analysis adds depth to the evaluation. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component.
The paper provides a detailed description of the experimental setup, including model architectures, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing the implementation to facilitate validation and further research by the community.
The primary limitation identified is the reliance on some ideal tight labels for validation, which may not be feasible in all scenarios. Additionally, the model's performance might be constrained by the size and diversity of the datasets used, and the potential for over-tightening could negatively impact downstream applications. The authors acknowledge these limitations and suggest that future work should explore larger datasets without tight labels.
This research has significant implications for real-world applications in speaker diarization, particularly in scenarios where precise speaker segmentation is critical, such as in automated transcription services and conversational AI systems. By improving the accuracy of diarization models, this work can enhance the quality of multi-speaker audio processing, leading to better user experiences in various applications. The main contribution of this paper is the introduction of a novel method for generating tight boundary predictions in speaker diarization using causal-anticausal consistency, which significantly improves the performance of models trained on loosely annotated data. The comprehensive analysis of the methodology and experimental results underscores its potential to advance the field of audio processing and speaker diarization.
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.
Primary: University of Surrey
All Institutions: Hong Kong University of Science and Technology, Tencent, University of Surrey, Chinese University of Hong Kong, Hong Kong Baptist University, Hong Kong Polytechnic University, Independent Researcher
The paper makes a significant contribution by systematically investigating how speech representation design impacts text-native reasoning in LLMs, introducing innovative methodologies that enhance cross-modal understanding. The comprehensive analysis of frame rates and representation alignment provides valuable insights for future research and applications in multimodal AI systems.
The paper presents a novel approach to speech representation design by systematically exploring the impact of frame rates and representation alignment on the reasoning capabilities of frozen LLMs. The introduction of factorized finite scalar quantization (FSQ) and a lightweight non-autoregressive audio LM head is particularly innovative, addressing the information bottleneck at low frame rates. The methodology is well-structured, focusing on controlled experiments that isolate the effects of speech tokenization on performance metrics. The use of contrastive learning for representation alignment across intermediate LLM layers is a significant methodological advancement that enhances the model's ability to bridge the modality gap between speech and text.
The experiments are comprehensive, utilizing a well-defined dataset (LibriSpeech) and a progressive training pipeline that includes multiple stages (ASR, TTS, and speech QA). The results demonstrate a clear understanding of the relationship between frame rate and model performance, with empirical findings supporting the proposed hypotheses. The U-shaped performance trends observed in ASR and TTS tasks provide valuable insights into the optimal operational regimes for speech QA. However, the reliance on a single dataset may limit the generalizability of the findings.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which should facilitate reproducibility. However, the lack of publicly available code or data may hinder independent verification of results. The authors acknowledge limitations in generalizing findings beyond the specific datasets used, which is a critical consideration for reproducibility in broader contexts.
The study is limited by its focus on English read speech, which may not generalize to conversational or noisy speech scenarios. The frozen LLM approach may impose a performance ceiling, and the lack of acoustic modeling could restrict the model's applicability. Additionally, the comparison with other methods is constrained by the unavailability of baseline training data and code, making it difficult to assess relative performance comprehensively.
This research has significant implications for the development of multimodal dialogue systems, particularly those that integrate speech and text processing. The findings could inform future designs of speech tokenizers and LLMs, enhancing their reasoning capabilities in real-world applications. The insights regarding frame rate and representation alignment could lead to more efficient and effective speech processing systems, potentially benefiting various industries, including customer service, education, and accessibility technologies. The paper makes a significant contribution by systematically investigating how speech representation design impacts text-native reasoning in LLMs, introducing innovative methodologies that enhance cross-modal understanding. The comprehensive analysis of frame rates and representation alignment provides valuable insights for future research and applications in multimodal AI systems.
Accurate and robust multimodal speaker identification is essential for multimedia understanding and biometric authentication. However, real-world polyglot scenarios pose two key challenges: speaker-discriminative representations should generalize across languages, and the model should remain reliable when face information is unavailable. To address these challenges, we propose MRAF, a Missing-Token Prompted Reliability-Aware Fusion framework for polyglot speaker identification across complete-modality, missing-face, and cross-lingual scenarios. MRAF represents unavailable face inputs with a learnable missing token instead of fixed zero-valued features, providing a trainable representation of the missing visual state. This design reduces the distribution gap caused by missing inputs and allows subsequent reliability estimation and cross-modal fusion to operate within a unified token space. To adaptively integrate modalities with different reliability, MRAF further introduces a reliability-aware cross-attention fusion module, which estimates face and audio reliability scores, normalizes them into modality weights, and applies these weights to token representations before bidirectional cross-attention. In this way, the model can emphasize reliable modality cues while suppressing unreliable ones. During training, MRAF jointly optimizes multi-branch classification losses, audio-only knowledge distillation, and center loss to improve speaker discrimination and missing-modality robustness. Experiments on the official POLY-SIM 2026 test set demonstrate the effectiveness of the proposed framework. In the final evaluation, MRAF achieves 100% accuracy on P3 and P5, and obtains competitive results on the more challenging missing-face settings P4 and P6. The source code will be released at https://github.com/MSA-LMC/MRAF.
Primary: Hefei University of Technology
All Institutions: Hefei University of Technology, Intelligent Interconnected Systems Laboratory of Anhui Province
The paper presents MRAF, a framework that innovatively addresses the challenges of polyglot speaker identification under incomplete modality conditions. Its contributions lie in the introduction of a learnable missing token and a reliability-aware fusion mechanism, which collectively enhance the model's robustness and accuracy in real-world applications.
The proposed MRAF framework introduces a novel approach to handling missing modalities in polyglot speaker identification by employing a learnable missing token, which enhances the model's ability to generalize across different conditions. The reliability-aware cross-attention fusion module is a significant innovation, allowing the model to dynamically adjust the contribution of each modality based on their estimated reliability. This is a sophisticated method that improves robustness and performance in challenging scenarios, particularly when one modality is missing.
The experiments conducted on the POLY-SIM 2026 test set demonstrate the effectiveness of MRAF, achieving impressive accuracy metrics, particularly in complete-modality settings. The results are well-presented, with clear comparisons to baseline methods and ablation studies that validate the contributions of different components of the model. The use of a diverse dataset with real-world variations adds credibility to the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training parameters, and evaluation protocols. However, the lack of a demo or interactive implementation limits the ease of reproducibility for external researchers. The authors do mention that the source code will be available, which is a positive aspect for future validation.
While the model shows strong performance, it may struggle in scenarios with significant noise or low-quality inputs, as noted in the limitations section. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other contexts or languages not represented in the training data.
The advancements in multimodal speaker identification have significant implications for applications in biometric authentication, multimedia retrieval, and human-computer interaction. The ability to effectively handle missing modalities could enhance the robustness of systems in real-world applications, making them more reliable in diverse environments. The paper presents MRAF, a framework that innovatively addresses the challenges of polyglot speaker identification under incomplete modality conditions. Its contributions lie in the introduction of a learnable missing token and a reliability-aware fusion mechanism, which collectively enhance the model's robustness and accuracy in real-world applications.
We present a quality-adaptive angular-margin learning framework that improves feature generalization by enforcing intra-class compactness and inter-class separability. Our framework, titled QLung, introduces a no-reference audio quality margin derived from spectral entropy and root-mean-square energy, which adaptively scales angular margins based on recording quality. To this end, we propose a log-scaled angular margin that stabilizes training under severe class imbalance. We also use an angular classifier that normalizes features and class weights, ensuring margin penalties are applied consistently on the unit hypersphere. Our approach improves in-distribution performance on the ICBHI dataset by 2.46\% over the cross-entropy baseline, and most significantly, achieves the strongest out-of-distribution performance on the SPRSound dataset compared to prior state-of-the-art methods. Code is available at https://github.com/RSC-Toolkit/QLung.
Primary: RSC LAB, MODULABS, Republic of Korea
All Institutions: RSC LAB, MODULABS, Republic of Korea, Department of Electronic Engineering, Wonkwang University, Republic of Korea, AI Convergence Research Institute, Wonkwang University, Republic of Korea
The paper presents QLung, a novel framework for respiratory sound classification that addresses quality variability and class imbalance through innovative angular-margin learning techniques. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
The proposed QLung framework introduces a quality-adaptive angular-margin learning approach that effectively addresses the challenges of low-quality recordings and class imbalance in respiratory sound classification. The dual-factor angular margin formulation is innovative, combining a no-reference audio quality margin with a log-scaled class imbalance margin, which enhances the model's ability to learn discriminative features. The use of an angular classifier that normalizes features and class weights is a significant methodological contribution, ensuring that the model focuses on angular similarity rather than feature magnitudes. This is particularly relevant in the context of respiratory sounds, where subtle differences are crucial for accurate classification.
The experiments are well-structured, utilizing the ICBHI and SPRSound datasets, which are standard benchmarks in the field. The reported improvements of 2.46% over the baseline and superior out-of-distribution performance demonstrate the effectiveness of the proposed method. The ablation studies provide insights into the contributions of each component of the QLung framework, reinforcing the robustness of the findings. However, additional comparisons with more recent methods could further validate the claims.
The paper provides sufficient implementation details, including the architecture, training parameters, and data preprocessing steps, which facilitate reproducibility. The availability of the code on GitHub enhances transparency and allows other researchers to replicate the results.
While the approach shows promising results, the reliance on the quality of the audio recordings remains a potential limitation. The method may not generalize well to datasets with significantly different characteristics or noise profiles. Additionally, the performance metrics could benefit from further exploration of other evaluation criteria beyond specificity and sensitivity.
The QLung framework has significant implications for clinical applications, particularly in the diagnosis of respiratory diseases where accurate classification of lung sounds is critical. By improving model robustness against low-quality recordings and class imbalance, this research could enhance the reliability of automated diagnostic tools in healthcare settings. Moreover, the methodologies developed could inspire further research in other audio classification tasks facing similar challenges. The paper presents QLung, a novel framework for respiratory sound classification that addresses quality variability and class imbalance through innovative angular-margin learning techniques. The comprehensive methodology and rigorous experimental validation position this work as a meaningful contribution to the field of machine learning in audio processing.
Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.
Primary: The University of Texas at San Antonio
All Institutions: The University of Texas at San Antonio, Texas A&M University
The main contribution of this paper is the introduction of NeurMLLM, a multimodal generative framework that effectively integrates acoustic features, transcripts, and demographic context for the fine-grained staging of neurodegenerative diseases. This innovative approach not only outperforms existing methods but also highlights the potential of multimodal LLMs in clinical applications, paving the way for future advancements in the field.
The methodology presented in this paper is innovative, particularly in its integration of multimodal data (acoustic features, transcripts, and demographic information) within a unified framework. The use of vision transformers for encoding audio data and the instruction-tuning of a large language model (LLM) through Low-Rank Adaptation (LoRA) is a notable advancement. The generative classification approach, which reformulates the task as constrained label-token generation, is a significant departure from traditional classification methods, allowing for better alignment of multimodal evidence with clinical stages.
The experiments are comprehensive and well-structured, utilizing the Bridge2AI-Voice dataset for evaluating the proposed NeurMLLM framework. The results demonstrate a clear performance advantage over classical machine learning methods and existing LLM-based approaches, showcasing the effectiveness of the proposed multimodal architecture. The evaluation metrics, including macro-AUROC, accuracy, macro-F1, and macro-recall, are appropriate for the task and provide a robust assessment of model performance.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation protocols, which would allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the results.
The study acknowledges limitations such as the small cohort size, which may introduce performance variance and necessitates validation on larger datasets. Additionally, while the constrained label-token generation approach shows promise, the underlying mechanisms contributing to its effectiveness require further exploration. The reliance on specific LLM backbones may also limit the generalizability of the findings.
The proposed framework has significant implications for the field of neurodegenerative disease screening, offering a scalable and non-invasive method for early detection. By leveraging voice-based biomarkers, this approach could enhance accessibility to diagnostic tools and improve patient outcomes. The integration of multimodal data also opens avenues for further research into personalized medicine and targeted interventions. The main contribution of this paper is the introduction of NeurMLLM, a multimodal generative framework that effectively integrates acoustic features, transcripts, and demographic context for the fine-grained staging of neurodegenerative diseases. This innovative approach not only outperforms existing methods but also highlights the potential of multimodal LLMs in clinical applications, paving the way for future advancements in the field.
Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.
Primary: New York University
All Institutions: New York University, Sony Group Corporation
This paper presents a pioneering framework for evaluating generative spatial audio metrics, addressing a critical gap in the understanding of metric sensitivity. The comprehensive methodology and experimental design contribute valuable insights into the performance of various metrics, paving the way for future advancements in the field.
The paper introduces a novel framework for sensitivity analysis of generative spatial audio metrics, focusing on three key desiderata: Responsiveness, Smoothness, and Symmetry. The methodology is well-structured, employing a systematic approach to evaluate the performance of various metrics under controlled spatial parameter changes. The use of a custom dataset with increasing scene complexity and the definition of clear metrics for evaluation are commendable. However, the reliance on synthetic data may limit the generalizability of the findings.
The experiments are comprehensive, involving a large dataset of 68,400 samples across various conditions, including clean and noisy environments. The evaluation of multiple metrics provides a robust comparison, and the results are clearly presented. The analysis of how metrics respond to different complexities and noise conditions is insightful, although the paper could benefit from more detailed statistical analysis to support the claims made.
The paper includes a GitHub repository for the project, which is a positive aspect for reproducibility. However, the specifics of the implementation details, such as the exact configurations used for the experiments and the datasets, could be more thoroughly documented to enhance reproducibility.
One significant limitation is the focus on artificially synthesized FOA data, which may not fully capture the complexities of real-world audio scenarios. Additionally, the study is limited to a small set of metrics, and future work is needed to expand the framework to include a broader range of evaluation metrics and real-world data.
The findings of this study have the potential to significantly impact the field of spatial audio generation by providing a clearer understanding of how different metrics behave under varying conditions. This could lead to improved evaluation standards and methodologies in the development of generative audio models, ultimately enhancing the quality of immersive audio experiences. This paper presents a pioneering framework for evaluating generative spatial audio metrics, addressing a critical gap in the understanding of metric sensitivity. The comprehensive methodology and experimental design contribute valuable insights into the performance of various metrics, paving the way for future advancements in the field.
This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, National Research Council Canada
The paper presents a novel data-free and training-free compression method for speech foundation models through parameter clustering. This approach significantly enhances the efficiency of ASR systems while maintaining competitive performance, marking a meaningful advancement in the field of machine learning and speech technology.
The proposed methodology introduces a novel parameter clustering technique that diverges from traditional pruning methods, focusing on data-free and training-free compression. The use of k-means clustering to group similar parameters is innovative, particularly in the context of speech foundation models. The mixed sparsity approach, which assigns varying numbers of clusters based on layer-wise variance, adds an additional layer of sophistication that enhances the model's adaptability and performance. However, the paper could benefit from a more detailed explanation of the clustering process and its implications on model interpretability.
The experiments conducted on the LibriSpeech dataset are robust, demonstrating significant improvements in word error rates (WER) compared to magnitude-based pruning. The results indicate that the proposed method not only maintains performance but also achieves notable reductions in WER across different sparsity levels. The fine-tuning process is well-structured, although the paper could provide more clarity on the specific configurations and hyperparameters used during fine-tuning for reproducibility.
While the paper outlines the experimental setup and the models used, it lacks specific implementation details such as code availability or a clear description of the clustering algorithm's parameters. This omission could hinder reproducibility for other researchers attempting to validate the findings or build upon the work.
One limitation of the approach is its reliance on the assumption that similar parameters can be effectively clustered without significant loss of information. This may not hold true for all model architectures or datasets. Additionally, while the results are promising, the paper does not explore the long-term effects of the proposed compression on model performance in diverse real-world applications.
The implications of this research are significant, particularly for deploying speech models in resource-constrained environments, such as mobile devices. By enabling efficient model compression without the need for extensive data or training, this work could facilitate broader accessibility and usability of advanced speech recognition technologies in everyday applications. The paper presents a novel data-free and training-free compression method for speech foundation models through parameter clustering. This approach significantly enhances the efficiency of ASR systems while maintaining competitive performance, marking a meaningful advancement in the field of machine learning and speech technology.
Neural speech codecs enable low-bitrate speech communication, yet at ultra-low bitrates (< 1000 bps) preserving perceptual quality and intelligibility is challenging. Existing designs often prioritize acoustic details, leaving limited capacity for the core linguistic message under tight bitrate constraints. To address this, we propose ContextCodec, a codec that transmits content-focused context features to explicitly guide reconstruction. ContextCodec adopts a dual-branch encoder that decouples acoustic details from content-focused context. The context branch is trained with a CLIP-style contrastive loss that aligns context features with phoneme indices, reducing paralinguistic leakage. During decoding, these features are injected at each decoding stage for explicit guidance. In addition, we introduce a lightweight autoregressive latent refinement module. Experiments show a strong quality-intelligibility trade-off down to 500 bps, with an RTF of 0.4886 on a typical mobile CPU.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of ContextCodec, a context-guided neural speech codec that prioritizes linguistic content for ultra-low bitrate speech communication, achieving a favorable balance between intelligibility and perceptual quality. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in speech coding while offering practical solutions for real-world applications.
The proposed ContextCodec introduces a dual-branch encoder architecture that effectively separates acoustic details from content-focused context, which is a significant methodological advancement in the field of neural speech coding. The use of a CLIP-style contrastive loss for phoneme alignment is innovative, as it directly addresses the challenge of paralinguistic leakage while enhancing linguistic representation. Additionally, the lightweight autoregressive latent refinement module contributes to improved reconstruction quality, showcasing a thoughtful integration of various techniques to optimize performance under stringent bitrate constraints.
The experiments conducted are robust, utilizing both objective and subjective evaluation metrics to assess performance at ultra-low bitrates. The paper demonstrates a strong quality-intelligibility trade-off, particularly at 500 bps, which is critical for practical applications. The inclusion of a diverse multilingual dataset for evaluation further strengthens the findings, indicating the codec's potential for cross-lingual generalization. However, the paper could benefit from more extensive comparisons with a broader range of existing codecs to contextualize its contributions more clearly.
The paper provides a detailed description of the architecture, training procedures, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly accessible code repository limits the ability for other researchers to replicate the results independently. Including a link to a GitHub repository or similar would enhance the reproducibility of the work.
One limitation noted is the potential underrepresentation of rare or unseen phonemes in the training data, which could affect the codec's performance in diverse linguistic contexts. Additionally, while the paper presents strong results, the subjective evaluation is based on a limited number of listeners and utterances, which may not fully capture the codec's performance across a wider audience.
The ContextCodec has significant implications for real-time speech communication in bandwidth-constrained environments, such as mobile devices and satellite communications. By prioritizing intelligibility and content preservation, it can enhance user experiences in applications like telephony, voice assistants, and other audio communication technologies. The methodology could also inspire future research in speech coding and multimodal audio processing. The main contribution of this paper is the introduction of ContextCodec, a context-guided neural speech codec that prioritizes linguistic content for ultra-low bitrate speech communication, achieving a favorable balance between intelligibility and perceptual quality. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in speech coding while offering practical solutions for real-world applications.
We introduce SSL-GMMVC, an interpretable voice conversion method in self-supervised speech space. The method models paired source-target features with a Gaussian mixture model and performs conversion as a posterior-weighted sum of affine transforms. This yields locally linear transformations that adapt to heterogeneous feature-space structure while remaining analytically tractable. Through objective and subjective evaluations, we show that SSL-GMMVC improves speaker similarity with comparable intelligibility and naturalness, and that even a constrained covariance variant surpasses a deep learning baseline as the number of mixture components increases. Further analyses link component selection to phonetic structure and reveal interpretable scaling and rotation in the learned transforms. These findings highlight SSL-GMMVC as an effective, analyzable framework for voice conversion.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
The paper presents SSL-GMMVC, an innovative voice conversion method that enhances speaker similarity while maintaining intelligibility and naturalness. The comprehensive evaluation of its performance and the interpretability of its transformations contribute significantly to the field of audio processing and machine learning.
The proposed SSL-GMMVC method utilizes Gaussian Mixture Models (GMMs) to perform voice conversion in a self-supervised representation space. It innovatively replaces the single global mapping of previous methods with locally linear transformations, allowing for a more nuanced adaptation to the structure of the feature space. The methodology is well-structured, with clear definitions of the model architecture, learning process, and inference mechanisms. The use of both objective and subjective evaluations strengthens the methodology, providing a comprehensive understanding of the model's performance.
The experiments are rigorously designed, comparing SSL-GMMVC against both traditional and deep learning baselines across various configurations. The dataset is appropriate, comprising American English speech from the CMU ARCTIC corpus, and the evaluation metrics (EER for speaker similarity, WER for intelligibility, and MOS for naturalness) are relevant and robust. The results demonstrate that SSL-GMMVC outperforms LinearVC in specific settings and shows competitive performance against FreeVC, indicating its effectiveness in voice conversion tasks.
The paper provides sufficient implementation details, including the extraction of SSL features and the training process for the GMM. The authors have made the code publicly available on GitHub, which enhances reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup and hyperparameter choices to facilitate replication.
One limitation noted is the challenge of scaling the number of mixture components while maintaining stable parameter estimation in high-dimensional SSL space. Additionally, while the analysis of transformation matrices provides insights, it does not establish clear correspondences between rotational planes across speaker pairs, which could limit interpretability.
The findings from this research have significant implications for applications in voice conversion, including anonymization, language learning, and assistive technologies. The interpretable nature of the model could lead to advancements in understanding and manipulating speech features, potentially influencing future research in speech synthesis and transformation. The paper presents SSL-GMMVC, an innovative voice conversion method that enhances speaker similarity while maintaining intelligibility and naturalness. The comprehensive evaluation of its performance and the interpretability of its transformations contribute significantly to the field of audio processing and machine learning.
The proliferation of text-to-speech (TTS) systems capable of generating realistic synthetic speech poses growing challenges for audio forensics. While binary deepfake detection has received considerable attention, source tracing (i.e., identifying which TTS system produced a given audio sample) remains underexplored, particularly in open-set scenarios where unknown systems may be encountered. We propose a metric learning framework based on the Proxy-Anchor loss function that operates on Wav2Vec2-BERT embeddings to learn a discriminative embedding space for TTS source attribution and out-of-distribution (OOD) detection of unseen systems. We evaluate it on the MLAAD v9 dataset spanning 140 TTS systems across 51 languages, and introduce an architecture merging strategy that groups TTS system versions into unified classes, reducing inter-class confusion. Our system achieves 99.76% accuracy on 110 in-distribution classes and a False Positive Rate (FPR@95) as low as 2.04% for OOD detection. Also, for a fair comparison against the current state of the art, we further evaluate it on the MLAAD v5 official dataset splits, improving the OOD accuracy by almost doubling it. These results demonstrate that Proxy-Anchor metric learning, combined with architecture-aware class design and post-hoc OOD scoring, provides an effective framework for forensic TTS source tracing in both closed-set and open-set settings.
Primary: POLITEHNICA Bucharest
All Institutions: POLITEHNICA Bucharest, Bitdefender
This paper makes a significant contribution to the field of audio forensics by proposing an innovative metric learning framework for TTS source attribution and OOD detection, demonstrating high accuracy and low false positive rates across a diverse dataset. The methodology and results provide a solid foundation for future advancements in the detection and attribution of synthetic speech.
The paper introduces a novel approach to TTS source attribution through a metric learning framework leveraging the Proxy-Anchor loss function. The methodology is well-structured, utilizing Wav2Vec2-BERT embeddings to create a discriminative embedding space. The architecture merging strategy to reduce inter-class confusion is particularly innovative, addressing a significant challenge in the field. The dual-stage inference process for OOD detection and ID attribution is methodologically sound and effectively integrates multiple scoring functions to enhance performance.
The experiments are robust, utilizing the MLAAD v9 dataset, which is comprehensive in terms of the number of TTS systems and languages covered. The results demonstrate high accuracy in both ID classification and OOD detection, with significant improvements over existing methods. The evaluation metrics are appropriate, and the comparison against state-of-the-art methods is thorough, showcasing the effectiveness of the proposed approach.
The implementation details are clearly outlined, including the model architecture, training parameters, and dataset partitioning. The availability of code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work. However, the paper could benefit from more detailed descriptions of the experimental setup and baseline comparisons.
While the paper presents strong results, it acknowledges the need for future work to address robustness in real-world conditions and adversarial attacks. The reliance on a specific dataset may limit generalizability, and the merging strategy, while effective, may not be universally applicable across all TTS systems.
The proposed framework has significant implications for audio forensics, particularly in combating the misuse of TTS technology in fraud and disinformation. The ability to trace the source of synthetic speech can enhance trust in audio content and support legal investigations. The approach also opens avenues for further research in OOD detection and metric learning in audio applications. This paper makes a significant contribution to the field of audio forensics by proposing an innovative metric learning framework for TTS source attribution and OOD detection, demonstrating high accuracy and low false positive rates across a diverse dataset. The methodology and results provide a solid foundation for future advancements in the detection and attribution of synthetic speech.
While Speech Large Language Models (Speech-LLMs) have achieved strong performance on adult Automatic Speech Recognition (ASR), their effectiveness on child speech remains under-explored, and single models often struggle to handle diverse adult and child age groups simultaneously. This paper proposes a Mixture-of-Experts (MoE) Speech-LLM for unified ASR across adult and child speech spanning diverse environments and age groups. The framework employs a Classifier-based Domain Router (C-DR) with a coarse-to-fine strategy and integrates both a Mixture-of-Projectors (MoP) and a Mixture-of-LoRAs (MoL) to model domain-specific variations. To address routing uncertainty near domain boundaries, an Entropy-Aware Routing (EAR) mechanism is introduced to dynamically incorporate a shared expert. Experiments on public child corpora demonstrate consistent improvements over baselines while preserving adult ASR performance. To our knowledge, this is the first work leveraging Speech-LLMs for unified, multi-domain ASR encompassing both children and adults.
Primary: University of California, Los Angeles
All Institutions: University of California, Los Angeles
This paper presents a novel Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM framework that effectively tackles the challenges of ASR across diverse child and adult speech domains. The integration of advanced routing mechanisms and domain-specific modeling represents a meaningful advancement in the field, with potential applications in various educational and assistive technologies.
The proposed methodology introduces a Mixture-of-Experts (MoE) framework that integrates a Classifier-based Domain Router (C-DR) and an Entropy-Aware Routing (EAR) mechanism, which are innovative in addressing the challenges of Automatic Speech Recognition (ASR) across heterogeneous domains, specifically for child and adult speech. The use of a coarse-to-fine classification strategy for domain routing and the incorporation of both Mixture-of-Projectors (MoP) and Mixture-of-LoRAs (MoL) are particularly noteworthy, as they allow for specialized handling of acoustic and linguistic variations. This dual approach enhances the model's adaptability and robustness, making it a significant contribution to the field.
The experiments conducted on public child corpora demonstrate consistent improvements over baseline models, validating the effectiveness of the proposed framework. The paper provides a comprehensive evaluation of the model's performance across various age groups and environments, showcasing its capability to maintain adult ASR performance while improving child ASR. The use of multiple datasets and the clear presentation of results, including comparisons with state-of-the-art methods, strengthen the experimental evaluation.
The paper outlines the training and inference settings in detail, including the architecture used, the datasets, and the training parameters. However, the lack of publicly available code or data limits reproducibility. The authors could enhance this aspect by providing a GitHub repository or similar resources for other researchers to replicate their findings.
One limitation is the reliance on specific datasets, which may not generalize to all child speech scenarios. Additionally, while the model shows improvements, it may still struggle with extreme cases of domain mismatch or highly diverse acoustic environments. The paper does not address the computational complexity introduced by the MoE architecture, which could be a concern for real-time applications.
The proposed framework has significant implications for improving ASR systems for children, which could enhance accessibility in educational technologies and speech therapy applications. By addressing the unique challenges posed by child speech, this research could lead to more inclusive and effective communication technologies. This paper presents a novel Entropy-Aware Domain-Routed Mixture-of-Experts Speech-LLM framework that effectively tackles the challenges of ASR across diverse child and adult speech domains. The integration of advanced routing mechanisms and domain-specific modeling represents a meaningful advancement in the field, with potential applications in various educational and assistive technologies.
Transformer-based Speech Foundation Models excel in most Automatic Speech Recognition tasks but often suffer performance degradation when applied to domains with mismatched acoustic characteristics. While Parameter Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), adjust global attention, they lack the local context modeling crucial for capturing domain-specific variations. We propose GC-LoRA, a novel adapter architecture that injects Conformer-style local convolutional processing into pretrained Transformer encoders. By integrating a lightweight adapter to encoder attention output projections, our method efficiently captures local acoustic dependencies without disrupting pretrained global representations. Experiments across diverse datasets (acoustically-degraded, bandlimited, dialectal, child) demonstrate the efficacy of our approach, achieving Word Error Rate (WER) reductions of up to 10.9% compared to baselines while adding minimal trainable parameters.
Primary: University of California, Los Angeles
All Institutions: University of California, Los Angeles
The paper presents GC-LoRA, a novel adapter architecture that enhances parameter-efficient acoustic adaptation in Transformer-based ASR systems. The integration of local convolutional processing addresses critical limitations in existing methods, demonstrating substantial improvements in performance across diverse acoustic datasets.
The methodology presented in this paper introduces GC-LoRA, which integrates local convolutional processing into the low-rank adaptation framework of Transformer models. The authors effectively leverage the strengths of Conformer architectures to enhance local context modeling while maintaining parameter efficiency. The design choice to embed gated depthwise-separable convolutions within the LoRA bottleneck is innovative and addresses a critical gap in existing PEFT methods. The detailed description of the architecture and its components, including the gating mechanism and the use of Group Normalization, demonstrates a solid understanding of both theoretical and practical aspects of model adaptation.
The experiments are well-structured, utilizing diverse datasets that reflect various acoustic challenges, including environmental degradation and dialectal variations. The results indicate statistically significant improvements in Word Error Rate (WER) across all tested datasets compared to standard LoRA and other baselines. The inclusion of ablation studies adds depth to the evaluation, allowing for a clear understanding of the contributions of different components of GC-LoRA. The performance metrics are robust, and the authors provide a thorough analysis of the results, enhancing the credibility of their findings.
The paper includes sufficient detail regarding the experimental setup, including model configurations, training protocols, and datasets used. However, the reproducibility could be further strengthened by providing more explicit instructions or scripts for replicating the experiments, especially for those who may not have access to the same resources or datasets.
One limitation of the study is the reliance on specific datasets, which may not fully encompass the range of acoustic variations encountered in real-world applications. Additionally, while the results are promising, the generalizability of GC-LoRA to other domains or tasks beyond those tested remains to be explored. The paper could also benefit from a discussion on the computational overhead associated with the proposed method, particularly in deployment scenarios.
The proposed GC-LoRA architecture has significant implications for improving the robustness of ASR systems in diverse acoustic environments. By enabling efficient adaptation of Transformer models to domain-specific variations, this work could enhance accessibility and usability in applications such as voice recognition for children, dialectal speech processing, and environments with background noise. The findings could influence future research directions in parameter-efficient model adaptation and contribute to the development of more resilient speech technologies. The paper presents GC-LoRA, a novel adapter architecture that enhances parameter-efficient acoustic adaptation in Transformer-based ASR systems. The integration of local convolutional processing addresses critical limitations in existing methods, demonstrating substantial improvements in performance across diverse acoustic datasets.
Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.
Primary: University of California, Los Angeles
All Institutions: University of California, Los Angeles
The paper presents Gumbel-BEARD, a novel framework for automatic layer selection in speech recognition models, which significantly enhances performance in low-resource domains. The innovative methodology and rigorous experimental evaluation position this work as a meaningful contribution to the field of machine learning and audio processing.
The proposed Gumbel-BEARD framework introduces a novel approach to layer selection in domain adaptation for speech recognition models, specifically targeting low-resource domains. By leveraging a hard Gumbel-Softmax selector, the method automates the process of selecting the optimal encoder layer for self-supervised adaptation, which is a significant advancement over previous methods that relied on fixed layer selection. This end-to-end trainable approach allows for dynamic adaptation to varying acoustic characteristics, addressing a critical gap in existing frameworks. The integration of a BEST-RQ objective further enhances the model's ability to adapt without manual tuning, showcasing a robust methodological innovation.
The experimental setup is comprehensive, utilizing multiple datasets (MyST, OGI Kids, and CORAAL) to evaluate the framework's performance across different domains. The results demonstrate that Gumbel-BEARD achieves state-of-the-art word error rates, significantly outperforming both supervised fine-tuning and the original BEARD framework. The statistical significance of the results is rigorously assessed, reinforcing the validity of the findings. The experiments effectively illustrate the scalability of the method to larger architectures, which is crucial for real-world applications.
The paper provides sufficient detail regarding the implementation of Gumbel-BEARD, including hyperparameter settings and experimental protocols. However, the absence of a publicly available code repository limits the reproducibility of the results. Future work should consider releasing the code to facilitate validation and further exploration by the research community.
While the approach shows promise, it is primarily evaluated on specific datasets, which may limit its generalizability to other low-resource domains not represented in the experiments. Additionally, the reliance on a single architecture (Whisper) may restrict the applicability of the findings to other speech recognition models. The potential computational overhead introduced by the dynamic layer selection process could also be a concern in resource-constrained environments.
The Gumbel-BEARD framework has significant implications for improving speech recognition systems in low-resource settings, particularly for applications involving child speech and dialectal variations. By automating layer selection, the method reduces the need for extensive labeled data, making it more accessible for researchers and practitioners working in underrepresented domains. This advancement could lead to broader adoption of speech recognition technologies in diverse linguistic and cultural contexts, ultimately enhancing communication and accessibility. The paper presents Gumbel-BEARD, a novel framework for automatic layer selection in speech recognition models, which significantly enhances performance in low-resource domains. The innovative methodology and rigorous experimental evaluation position this work as a meaningful contribution to the field of machine learning and audio processing.
User-defined keyword spotting (KWS) enables personalized voice interaction by detecting user-specified keywords. A key challenge in this task is distinguishing target keywords from phonetically confusable alternatives. To address this challenge, we propose KFC-KWS, a multimodal framework that leverages connectionist temporal classification (CTC)-guided keyframe selection. Specifically, we exploit the peaky posterior distributions of CTC to identify high-confidence phoneme frames, enabling precise alignment across audio, phoneme, and text modalities. These keyframes are then fused with full-utterance representations through cross-attention to capture both local discriminative cues and global contextual information. On LibriPhrase, KFC-KWS achieves the best-balanced performance (98.73% AUC) and substantially outperforms advanced baselines on the challenging hard subset (97.65% AUC and 7.75% EER), demonstrating its effectiveness in discriminating between highly confusable keywords.
Primary: Hangzhou Dianzi University
All Institutions: Hangzhou Dianzi University, School of Communication Engineering, School of Electronics and Information Engineering
The main contribution of this paper is the introduction of KFC-KWS, a multimodal framework that utilizes CTC-guided keyframe selection to achieve superior performance in user-defined keyword spotting. This work represents a significant advancement in the field of audio processing and keyword detection, addressing key challenges associated with phonetically confusable keywords and enhancing the robustness of voice-interactive systems.
The proposed KFC-KWS framework introduces a novel approach to user-defined keyword spotting by leveraging connectionist temporal classification (CTC) for keyframe selection. This method effectively identifies high-confidence phoneme frames and fuses them with full-utterance representations using cross-attention mechanisms. The integration of CTC-guided keyframe selection is a significant methodological advancement, as it allows for fine-grained phoneme-level cross-modal matching, addressing a critical challenge in distinguishing phonetically confusable keywords. The use of random modality masking during training is another innovative aspect that enhances robustness.
The experiments conducted on the LibriPhrase dataset demonstrate the effectiveness of KFC-KWS, achieving the highest balanced performance metrics (98.73% AUC) compared to existing state-of-the-art methods. The paper provides a thorough evaluation across different subsets, including both easy and hard keywords, which is crucial for practical deployment. The ablation studies further validate the contributions of each modality, showcasing the importance of phoneme and text encoders in the overall performance.
The implementation details are well-documented, including the choice of pre-trained encoders, training parameters, and the overall architecture. However, the reproducibility could be enhanced by providing access to the complete codebase and trained models, which are not currently available.
One limitation is the reliance on the LibriPhrase dataset, which may not fully represent real-world scenarios with diverse accents and noisy environments. Additionally, while the proposed method shows significant improvements, it may still struggle with extreme phonetic variations or in highly noisy conditions, which are common in practical applications.
The KFC-KWS framework has the potential to significantly enhance user-defined keyword spotting systems, making them more adaptable and efficient for personalized voice interactions. This could lead to improved user experiences in various applications, including smart home devices, virtual assistants, and accessibility tools for individuals with speech impairments. The main contribution of this paper is the introduction of KFC-KWS, a multimodal framework that utilizes CTC-guided keyframe selection to achieve superior performance in user-defined keyword spotting. This work represents a significant advancement in the field of audio processing and keyword detection, addressing key challenges associated with phonetically confusable keywords and enhancing the robustness of voice-interactive systems.
The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, Korea Advanced Institute of Science and Technology, The Chinese University of Hong Kong, Johns Hopkins University, The University of Melbourne
The paper presents a comprehensive overview of the ESDD2 challenge, highlighting significant advancements in environment-aware audio deepfake detection. The methodology and results provide valuable insights for future research, emphasizing the importance of modular designs and diverse modeling approaches in tackling complex audio manipulation challenges.
The methodology presented in the paper is robust, focusing on a comprehensive challenge that evaluates various audio deepfake detection systems. The challenge design emphasizes component-level audio spoofing detection, which is a significant advancement over traditional whole-utterance approaches. The use of modular task decomposition, cross-domain self-supervised learning, and targeted data augmentation strategies demonstrates a thoughtful approach to tackling the complexities of audio manipulation. The paper effectively summarizes the design choices of top-performing systems, which can guide future research.
The experimental evaluation is thorough, with a clear presentation of results from 13 teams that participated in the challenge. The leaderboard rankings based on Macro-F1 scores provide a quantitative measure of system performance. The auxiliary EER analyses further enrich the evaluation by highlighting specific challenges in detecting environmental components. The dataset used, CompSpoofV2, is extensive and well-structured, allowing for meaningful comparisons across different systems.
The paper emphasizes reproducibility by making the CompSpoofV2 dataset and baseline code publicly available. This is crucial for the research community, as it allows others to replicate the experiments and build upon the findings. However, the paper could benefit from more detailed descriptions of the specific implementations used by top-performing teams to enhance reproducibility further.
One limitation noted in the paper is the persistent difficulty in detecting spoofed environmental components and generalizing to unseen generators. This indicates that while significant progress has been made, challenges remain in achieving robustness across diverse audio types and generation settings. Additionally, the reliance on specific architectures and data augmentation techniques may limit the applicability of findings to other contexts.
The findings from this research have significant implications for audio security and integrity, particularly in applications where audio authenticity is critical, such as in journalism, legal settings, and content creation. The challenge fosters innovation in deepfake detection methods and encourages the development of more robust systems that can adapt to evolving spoofing techniques. The paper presents a comprehensive overview of the ESDD2 challenge, highlighting significant advancements in environment-aware audio deepfake detection. The methodology and results provide valuable insights for future research, emphasizing the importance of modular designs and diverse modeling approaches in tackling complex audio manipulation challenges.
Recent research has explored integrating Large Language Models (LLMs) with speech encoders to create speech-augmented LLMs capable of contextualized speech recognition. The main challenge lies in aligning the semantic embeddings of LLMs with the acoustic representations of speech encoders. We propose a novel approach that teaches the LLM to first predict phonemes from the speech features before generating the final transcript. By integrating a phoneme prediction step directly into the LLM, the model develops a fine-grained knowledge of pronunciation, reducing acoustic confusion and improving transcription accuracy and explainability. Our method is cheap and simple, as phoneme targets can be automatically derived from existing transcripts. Through comprehensive experiments, we show that intermediate phoneme prediction can improve speech recognition, particularly in low-resource settings, and yields outputs that are acoustically more faithful to the speech.
Primary: KU Leuven
All Institutions: KU Leuven
The paper presents a novel phoneme-first prediction method for LLM-based speech recognition, significantly enhancing transcription accuracy and explainability. The innovative integration of phoneme prediction into the speech recognition pipeline represents a meaningful contribution to the field, with potential applications across various domains.
The proposed methodology of integrating phoneme-first prediction into LLM-based speech recognition is innovative, addressing the challenge of aligning semantic and acoustic representations effectively. By focusing on phoneme prediction before word transcription, the authors provide a structured approach that enhances the model's understanding of pronunciation nuances, which is a significant advancement in the field of speech recognition. The use of existing transcripts to derive phoneme targets adds practicality to the approach, making it accessible for various applications.
The experiments conducted on multiple datasets, including LibriSpeech and TED-LIUM, demonstrate the effectiveness of the phoneme-first approach. The results indicate substantial improvements in word error rates across different settings, particularly in low-resource environments. The comparative analysis with standard speech-to-text methods showcases the robustness of the proposed technique, although further exploration of larger datasets and diverse languages could strengthen the findings.
The paper provides detailed implementation specifics, including the architecture of the speech encoder, projection layer, and LLM, as well as the training setup. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should include sharing the model and code to facilitate validation by the research community.
One limitation noted is the potential for slower inference times due to the two-step prediction process. Additionally, while the phoneme-first method shows promise, it may not always outperform fully trained ASR models, particularly in data-rich scenarios. The reliance on the quality of phoneme labels also poses a challenge, as automatic phoneme generation may introduce errors.
This research has significant implications for the development of more accurate and explainable speech recognition systems, particularly in applications requiring high fidelity in transcription. The phoneme-first approach could enhance user interactions with voice-activated systems and improve accessibility for individuals with speech impairments. Furthermore, the methodology lays the groundwork for future research in phonetic modeling and speech assessment tasks. The paper presents a novel phoneme-first prediction method for LLM-based speech recognition, significantly enhancing transcription accuracy and explainability. The innovative integration of phoneme prediction into the speech recognition pipeline represents a meaningful contribution to the field, with potential applications across various domains.
Humans process rich auditory environments through tightly integrated cognitive capabilities such as audio perception, audio reasoning, and memory. Despite recent progress in large audio-language models (LALMs) across speech understanding and multimodal audio reasoning, current evaluation paradigms remain largely task- or modality-centric, focusing on end performance while overlooking underlying auditory cognitive behaviours. This reveals a fundamental gap between how auditory cognition is understood in humans and how it is evaluated in LALMs, particularly in the lack of frameworks that operationalise cognitive principles beyond task-level metrics to systematically capture model behaviour. In this work, we introduce RAIL, a human-centric evaluation paradigm grounded in the Cattell-Horn-Carroll (CHC) cognitive framework. RAIL formalises auditory cognition into five core capabilities and develop them into structured evaluation tasks that probe how models process, retain, and integrate auditory information. We further construct a cognitively grounded benchmark with principled data curation and human-aligned evaluation protocols. Evaluating 26 state-of-the-art LALMs, we find that current models exhibit highly uneven performance across cognitive abilities. RAIL establishes a new evaluation paradigm that moves beyond task-centric benchmarking toward cognitively grounded assessment of auditory intelligence.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, Alexandru Ioan Cuza University of Iași, Wuhan University, The University of Hong Kong, The University of Auckland, Monash University
The main contribution of this paper is the introduction of a cognitively grounded evaluation framework for auditory intelligence in large audio-language models, which moves beyond traditional task-centric approaches and provides a structured assessment of auditory cognitive capabilities. This work is significant as it lays the groundwork for future research aimed at developing more interpretable and human-aligned audio intelligence systems.
The paper introduces a novel evaluation framework, RAIL, grounded in the Cattell-Horn-Carroll (CHC) cognitive framework, which formalizes auditory cognition into five core capabilities. This structured approach allows for a more nuanced assessment of large audio-language models (LALMs) beyond traditional task-centric evaluations. The methodology is rigorous, involving a four-stage benchmark curation pipeline that includes cognitive framework selection, task formulation, dataset curation, and quality control. The tasks developed are well-defined and target specific auditory cognitive capabilities, which is a significant advancement in the evaluation of LALMs.
The authors evaluate 26 state-of-the-art LALMs, providing a comprehensive analysis of their performance across the proposed benchmark tasks. The results reveal significant gaps in auditory perception, reasoning, and memory capabilities, highlighting the limitations of current models. The evaluation metrics used are appropriate, and the findings are well-supported by empirical data. However, the paper could benefit from more detailed statistical analysis of the results to strengthen the claims made.
The paper provides sufficient details on the benchmark construction, model evaluation protocol, and data preparation, which are crucial for reproducing the experimental results. The authors mention that additional implementation details and evaluation scripts will be provided with the released benchmark/code package, which enhances reproducibility.
The paper discusses several limitations, including potential biases in dataset coverage, assumptions made during task design, and the generalizability of the benchmark beyond the evaluated models. However, a more explicit discussion of the implications of these limitations on the results would strengthen the paper.
The proposed benchmark has the potential to significantly enhance the evaluation of audio-language models, leading to more human-aligned and cognitively grounded systems. The paper addresses both positive impacts, such as improved model evaluation, and negative societal impacts, including potential misuse and bias amplification. The authors suggest safeguards for responsible data release, which is commendable. The main contribution of this paper is the introduction of a cognitively grounded evaluation framework for auditory intelligence in large audio-language models, which moves beyond traditional task-centric approaches and provides a structured assessment of auditory cognitive capabilities. This work is significant as it lays the groundwork for future research aimed at developing more interpretable and human-aligned audio intelligence systems.
We introduce a spoofing countermeasure architecture conditioned on speaker-reference recordings, but observe that it converges to a solution that effectively ignores the reference during inference. Surprisingly, training with a reference channel induces invariance that improves deepfake detection, even when the reference is absent or mismatched during inference. Based on this observation, we propose a Reference-Augmented Training (RAT) strategy. RAT yields improved detection performance compared to single-utterance baselines, even when the reference recording is replaced with a zero vector at inference. Through rigorous analysis, we demonstrate that the optimization process rapidly diminishes the reference contributions, leading to inference largely independent of the reference channel. Using RAT, we achieve state-of-the-art 2.57% EER and 0.074 minDCF on the ASVspoof 5 benchmark with a single detector, surpassing even large ensemble systems.
Primary: Brno University of Technology
All Institutions: Brno University of Technology
The main contribution of this paper is the introduction of Reference-Augmented Training (RAT), which enhances anti-spoofing performance by conditioning on reference recordings while demonstrating that the model can achieve robust performance even when the reference is absent during inference. This innovative approach, combined with rigorous experimental validation, positions the work as a significant advancement in the field of audio-based machine learning.
The proposed Reference-Augmented Training (RAT) methodology is innovative in its approach to integrating reference recordings into the training of anti-spoofing models. The use of a Reference-Informed Block (RIB) that combines multi-layer perceptrons and cross-attention mechanisms is a significant advancement over traditional single-input architectures. The authors effectively demonstrate that the model can learn to rely less on the reference input over time, which is a novel insight into training dynamics in deep learning models for audio processing.
The experiments are rigorously designed, utilizing the ASVspoof 5 benchmark, which is a comprehensive dataset that includes various spoofing attacks and conditions. The results show that RAT achieves state-of-the-art performance metrics (2.57% EER and 0.074 minDCF), outperforming existing methods, including ensemble systems. The ablation studies conducted to assess the model's performance under different reference conditions provide strong evidence of the robustness and effectiveness of the proposed method.
The authors have made their training and evaluation framework publicly available, which enhances the reproducibility of their results. They provide clear details about the model architecture, training procedure, and evaluation metrics, which are essential for other researchers to replicate their findings.
One limitation of the study is the reliance on a specific pretrained model (XLS-R) for feature extraction, which may not generalize to other architectures or datasets. Additionally, while the results are impressive, the paper does not explore the scalability of the RAT approach in real-world applications or its performance in scenarios with significantly different acoustic conditions.
The findings of this research have significant implications for the field of automatic speaker verification and anti-spoofing systems. By demonstrating that a model can be trained to be less dependent on reference inputs, this work opens avenues for developing more robust and flexible anti-spoofing solutions that can adapt to varying conditions in real-world applications. The main contribution of this paper is the introduction of Reference-Augmented Training (RAT), which enhances anti-spoofing performance by conditioning on reference recordings while demonstrating that the model can achieve robust performance even when the reference is absent during inference. This innovative approach, combined with rigorous experimental validation, positions the work as a significant advancement in the field of audio-based machine learning.
Recent multimodal large language models mainly process audio as monaural signals, thereby discarding the spatial cues contained in spatial audio for sound localization, spatial relation reasoning, and spatial scene understanding. We propose Spatial-Omni, a lightweight method that implements SO-Encoder to inject First-Order Ambisonics (FOA) spatial audio into existing Omni LLMs as an independent modality, without modifying their original audio encoders. SO-Encoder provides spatial tokens with limited additional context cost and improves spatial audio understanding through efficient staged training. To support training and evaluation, we construct SO-Dataset, SO-QA, and SO-Bench from open-source data, real recordings, and simulations, containing 400K FOA spatial audio clips and 2.1M spatial question answering pairs. SO-Bench covers 16 spatial audio understanding subtasks, including basic detection and location estimation, spatial relation understanding, and complex spatial reasoning. Experiments show that Spatial-Omni outperforms existing open-source Large Audio-Language Models (LALMs) and Omni LLM models on spatial audio understanding tasks while retaining a reasonable level of general audio understanding. Code and data are available at https://github.com/dieKarotte/Spatial-Omni.
Primary: Zhejiang University
All Institutions: Zhejiang University, Tencent Hunyuan
The main contribution of this work is the development of Spatial-Omni, a framework that enhances spatial audio understanding in multimodal LLMs through the innovative integration of FOA spatial audio, supported by comprehensive datasets and benchmarks. This research represents a significant step forward in the field of audio understanding, offering a robust methodology and promising results that could influence future developments in spatial audio processing and multimodal learning.
The paper introduces Spatial-Omni, a novel framework that integrates First-Order Ambisonics (FOA) spatial audio into existing large language models (LLMs) without altering their original audio encoders. The SO-Encoder is a lightweight addition that extracts spatial cues and provides spatial tokens, enhancing the model's understanding of spatial audio. The methodology is well-structured, employing a parallel architecture that preserves the semantic capabilities of the original audio encoder while allowing for the integration of spatial audio features. The staged training approach is particularly noteworthy, as it mitigates potential interference between spatial and semantic learning.
The authors construct comprehensive datasets (SO-Dataset, SO-QA) and benchmarks (SO-Bench) that facilitate the training and evaluation of spatial audio understanding. The experiments demonstrate that Spatial-Omni significantly outperforms existing models on various spatial audio tasks, showcasing the effectiveness of the proposed approach. The evaluation metrics are appropriate and well-defined, allowing for a thorough comparison with baseline models. The results indicate strong improvements across multiple subtasks, validating the technical contributions of the paper.
The paper provides sufficient details regarding the implementation, datasets, and evaluation metrics, which enhances the reproducibility of the results. The availability of code and datasets on GitHub further supports this aspect, enabling other researchers to replicate the experiments and build upon the work.
The paper acknowledges several limitations, including the focus on FOA input under a unified coordinate convention, which may restrict the model's applicability to other spatial representations. Additionally, the reliance on track-level supervision may limit performance in complex scenes with multiple overlapping sources. There is also a noted degradation in general audio capabilities when integrating spatial features, suggesting a need for better balance in future work.
The advancements in spatial audio understanding have significant implications for various applications, including virtual reality, augmented reality, and immersive audio experiences. The integration of spatial audio into LLMs can enhance user interactions in these domains, leading to more intuitive and engaging experiences. However, ethical considerations regarding privacy and potential misuse of spatial audio data must be addressed to ensure responsible deployment. The main contribution of this work is the development of Spatial-Omni, a framework that enhances spatial audio understanding in multimodal LLMs through the innovative integration of FOA spatial audio, supported by comprehensive datasets and benchmarks. This research represents a significant step forward in the field of audio understanding, offering a robust methodology and promising results that could influence future developments in spatial audio processing and multimodal learning.
Speech-aware large language models (LLMs) can incorporate speech through pre-trained acoustic encoders that project speech features into the LLM embedding space. While the choice of the speech encoder critically influences performance, different encoders often exhibit complementary strengths, motivating their combination. In this work, we investigate whether fusing multiple pre-trained speech encoders can enhance speech-aware LLMs for automatic speech recognition (ASR). We explore several fusion strategies beyond simple feature concatenation, including learned combinations and Transformer-based fusion architectures, and evaluate them across mono- and multilingual ASR settings as well as diarized speech recognition. Our results indicate that carefully fusing multiple parallel speech encoders improves downstream performance in all scenarios with limited computational overhead.
Primary: KU Leuven
All Institutions: KU Leuven
The main contribution of this paper is the innovative exploration of encoder fusion techniques to enhance speech-aware LLMs for automatic speech recognition. This work significantly advances the field by demonstrating that combining multiple pre-trained speech encoders can lead to improved performance across various ASR tasks, highlighting the importance of leveraging complementary strengths in machine learning models.
The paper presents a robust methodology for fusing multiple pre-trained speech encoders to enhance automatic speech recognition (ASR) capabilities in speech-aware large language models (LLMs). The authors explore various fusion strategies, including simple concatenation, learned combinations, and advanced Transformer-based architectures. This comprehensive exploration of fusion techniques, particularly in the context of multilingual and diarized speech recognition, demonstrates a thoughtful approach to leveraging the complementary strengths of different encoders. The detailed description of the encoder fusion process, including the use of gating mechanisms and multi-head attention, showcases a solid understanding of the underlying principles of neural networks and attention mechanisms.
The experiments are well-structured, covering monolingual, multilingual, and diarized speech recognition tasks. The authors provide clear evaluations using normalized Word Error Rates (WER) across different configurations, which allows for a thorough comparison of the proposed methods against baseline models. The results indicate significant improvements in performance, particularly with the temporal transformer fusion method, which suggests that the fusion strategies are effective. However, the paper could benefit from additional quantitative metrics beyond WER to provide a more comprehensive evaluation of the models' performance.
The paper includes sufficient implementation details, such as the architecture of the models, training procedures, and datasets used. However, the lack of publicly available code or a project URL limits the reproducibility of the results. Providing access to the trained models and code would greatly enhance the paper's impact and allow other researchers to validate and build upon the findings.
One limitation of the study is the focus on short-form speech recognition, which may not generalize well to longer or more complex speech tasks. Additionally, while the paper discusses the computational efficiency of the fusion methods, it does not provide detailed comparisons of the computational costs associated with each approach. The reliance on specific datasets may also limit the applicability of the findings to other languages or domains.
The research has significant implications for the development of more effective and versatile ASR systems, particularly in multilingual contexts. By improving the performance of speech-aware LLMs, the work could contribute to advancements in various applications, including voice assistants, transcription services, and accessibility technologies. The exploration of encoder fusion also opens avenues for future research in multimodal machine learning, potentially leading to more sophisticated models that can integrate multiple modalities seamlessly. The main contribution of this paper is the innovative exploration of encoder fusion techniques to enhance speech-aware LLMs for automatic speech recognition. This work significantly advances the field by demonstrating that combining multiple pre-trained speech encoders can lead to improved performance across various ASR tasks, highlighting the importance of leveraging complementary strengths in machine learning models.
Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.
Primary: National Taiwan University
All Institutions: National Taiwan University
The paper presents a novel method for manipulating attention in Large Audio-Language Models through instruction-based steering, demonstrating significant improvements in temporal localization tasks. This work contributes to the field by enhancing model interpretability and providing a framework for future research in audio understanding and multimodal learning.
The paper introduces a novel approach called instruction-based vector steering, which contrasts activations from differently instructed prompts while keeping the audio fixed. This methodology is innovative as it diverges from traditional audio-based steering methods, focusing instead on how prompts can redirect attention within the model. The systematic analysis of attention patterns across different conditions is well-structured, providing a clear understanding of how the proposed method alters the model's behavior. The mathematical formulation of the steering vector and the attention analysis is rigorous, although the paper could benefit from more detailed explanations of the equations presented.
The experiments are comprehensive, utilizing a controlled localization benchmark that isolates the effects of the proposed steering method. The results demonstrate significant improvements in attention allocation and localization accuracy compared to baselines. The paper presents a clear and logical progression from hypothesis to experimental validation, with quantitative metrics that substantiate the claims made. However, the reliance on two specific models (Qwen2-Audio and Audio Flamingo 3) limits the generalizability of the findings. More diverse datasets and models could enhance the robustness of the conclusions.
The paper lacks sufficient implementation details that would facilitate reproducibility. While the methodology is described, specific hyperparameters, training conditions, and the exact nature of the datasets used in experiments are not fully disclosed. Providing a supplementary material or a GitHub repository with code and data would greatly enhance the reproducibility of the results.
One limitation is the focus on only two models, which may not represent the broader landscape of LALMs. Additionally, the approach's effectiveness in more complex audio scenarios with overlapping sounds or varying acoustic conditions remains untested. The paper also does not address potential biases in the datasets used for evaluation, which could affect the generalization of the results.
The findings have significant implications for the interpretability of LALMs, potentially leading to improved audio understanding systems that can be applied in various domains, including human-computer interaction, audio search engines, and assistive technologies. The ability to probe and manipulate model attention could enhance the development of more robust audio processing systems, making them more reliable in real-world applications. The paper presents a novel method for manipulating attention in Large Audio-Language Models through instruction-based steering, demonstrating significant improvements in temporal localization tasks. This work contributes to the field by enhancing model interpretability and providing a framework for future research in audio understanding and multimodal learning.
Passive acoustic monitoring enables large-scale observation of wildlife, but most bioacoustic classifiers only predict species presence in a time window without localizing vocalizations precisely in time or frequency, limiting downstream analyses. We formulate bird vocalization detection as an object detection task on spectrograms and train YOLO11 models to localize bird calls in dense tropical soundscapes from Singapore. We additionally introduce an open-source browser-based annotation tool and propose Intersection over Minimum (IoMin), an evaluation metric that better handles ambiguous acoustic boundaries than standard IoU and is better suited to the problem at hand. The best YOLO model nearly doubles baseline performance on in-distribution soundscapes from Singapore (81.8% vs. 42.1% IoMin@50 F1-score) while still outperforming the baseline on unseen out-of-distribution recordings from Hawaii (58.6% vs. 48.6%). These results suggest that object detection frameworks are a promising approach to time-frequency localization of animal vocalizations in complex soundscapes.
Primary: National University of Singapore
All Institutions: National University of Singapore, Tropical Marine Science Institute, Dept. of Electrical and Computer Engineering, School of Computing
The paper presents a YOLO-based method for detecting and localizing bird vocalizations in dense soundscapes, significantly advancing the field of bioacoustic monitoring. The technical contributions, including a new evaluation metric and an open-source annotation tool, enhance the practical applicability of machine learning in ecological research.
The paper proposes a novel approach to bird vocalization detection by framing it as an object detection task using YOLO models on spectrograms. The methodology is well-structured, detailing the transformation of audio data into spectrograms, the training of YOLO models, and the development of an innovative annotation tool (BirdWatch) that enhances the labeling process. The introduction of the Intersection over Minimum (IoMin) metric is a significant contribution, addressing the inherent ambiguities in acoustic signal annotations, which is a common challenge in bioacoustic research.
The experiments are robust, utilizing two distinct datasets (Singapore and Hawaii) to evaluate model performance both in-distribution and out-of-distribution. The results demonstrate a clear improvement over baseline methods, with the best YOLO model achieving an IoMin@50 F1-score of 81.8% on in-distribution data, indicating strong performance. The comparative analysis against the energy-based TFE detector further substantiates the effectiveness of the proposed approach.
The paper provides sufficient details regarding the implementation of the YOLO models, including training parameters and data preprocessing steps. The open-source nature of the BirdWatch annotation tool and the availability of the code on GitHub enhance the reproducibility of the research, allowing other researchers to replicate the study or build upon it.
The authors acknowledge limitations related to the small size and geographic specificity of the training dataset, which may affect the generalization of the model to other regions and species. Additionally, the reliance on YOLO's input representation may impose constraints on performance, suggesting that alternative representations could be explored in future work.
This research has significant implications for wildlife monitoring and conservation efforts, enabling more precise localization of bird calls in complex soundscapes. The methodology can be extended to other species and environments, potentially aiding in biodiversity assessments and ecological studies. The open-source tools developed can facilitate further research in bioacoustics and machine learning applications in ecology. The paper presents a YOLO-based method for detecting and localizing bird vocalizations in dense soundscapes, significantly advancing the field of bioacoustic monitoring. The technical contributions, including a new evaluation metric and an open-source annotation tool, enhance the practical applicability of machine learning in ecological research.
Speech recognition often fails on rare, domain-specific terms and context-related named entities. Existing contextualization techniques typically bias decoding with keywords or phrase lists, which does not scale well or exploit deeper knowledge. We propose a training method that teaches a speech-LLM to use broad descriptions (e.g. from videos) as weak semantic priors to perform contextual reasoning grounded in the audio. We build 400 hours of reasoning-augmented speech data by pairing erroneous hypotheses with video metadata and LLM-generated reasoning explanations that justify context-driven corrections. We finetune the speech-LLM to perform chain-of-thought reasoning: generate an initial transcript, then reason over the context, and finally return a corrected transcript. On held-out YouTube-derived test sets, our approach reduces errors, with specific improvements on rare words and named entities, and lays groundwork for deeper contextual reasoning in speech recognition.
Primary: KU Leuven
All Institutions: KU Leuven, Department Electrical Engineering ESAT-PSI
The paper presents a significant advancement in contextual reasoning for ASR by integrating broad descriptions from video metadata, demonstrating improved performance in recognizing rare words and named entities. This work contributes to the ongoing evolution of ASR technologies, paving the way for more robust and context-aware systems in the future.
The paper introduces a novel training methodology that leverages broad contextual descriptions from video metadata to enhance the performance of speech-LLMs in automatic speech recognition (ASR). The two-stage reasoning process is well-structured, allowing the model to generate an initial transcript, engage in contextual reasoning, and produce a corrected output. The use of a large dataset (400 hours) and the integration of reasoning chains generated by LLMs demonstrate a thoughtful approach to addressing the limitations of existing ASR systems. However, the reliance on LLM-generated reasoning chains could introduce variability in the quality of the reasoning, which may affect overall performance.
The experiments are comprehensive, utilizing multiple datasets and comparing the proposed method against various baselines. The results indicate significant improvements in word error rates (WER), particularly for rare words and named entities, which are critical areas of concern in ASR. The evaluation methodology is robust, but the paper would benefit from additional qualitative assessments of the generated transcripts to complement the quantitative metrics.
The paper provides sufficient details on the methodology and dataset creation process, including the use of specific models and parameters for finetuning. However, the reproducibility could be enhanced by providing access to the trained models and clearer instructions on the setup for experiments. The open-sourcing of the dataset is a positive aspect that aids reproducibility.
One limitation is the potential variability in the quality of the reasoning chains generated by LLMs, which may not always align with the audio context. Additionally, the dataset is derived from YouTube videos, which may introduce biases based on the nature of the content available. The model's performance on shorter segments may also be limited due to the broad nature of the contextual descriptions.
The proposed method has significant implications for improving ASR systems, particularly in domains with specialized terminology and named entities. The approach could be extended to other applications, such as enhancing accessibility for individuals with hearing impairments or improving the accuracy of voice-activated systems. Furthermore, the methodology could inspire future research in multimodal learning and reasoning in audio contexts. The paper presents a significant advancement in contextual reasoning for ASR by integrating broad descriptions from video metadata, demonstrating improved performance in recognizing rare words and named entities. This work contributes to the ongoing evolution of ASR technologies, paving the way for more robust and context-aware systems in the future.
We present ViP-VL, an efficient Vietnamese Self-supervised speech Pretraining model leveraging Vector-quantization Learning. To bridge the gap between high-resolution audio and efficient processing, ViP-VL incorporates Acoustic Stacking and Receptive Field Alignment to enable a synchronized 8x subsampling rate within the ChunkFormer architecture, while further enhancing representation robustness through a specialized Mask Selection Strategy during pretraining on the BEST-RQ framework. Pretrained on 17,000 hours of unlabeled Vietnamese speech, our model establishes new state-of-the-art results across four major downstream tasks: Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification. To facilitate future research and the development of high-performance Vietnamese speech technologies, we publicly release our pretrained weights and implementation at github.com/khanld/chunkformer.
Primary: VinUniversity
All Institutions: VinUniversity, UNEY
The main contribution of this paper is the introduction of ViP-VL, a self-supervised speech pretraining model tailored for Vietnamese, which achieves state-of-the-art performance across multiple tasks while addressing key challenges in computational efficiency and representation robustness. This work not only advances the state of the art in speech processing for low-resource languages but also sets a foundation for future research in self-supervised learning methodologies.
The methodology presented in ViP-VL is innovative, particularly in its integration of Vector-Quantization Learning with the BEST-RQ framework and ChunkFormer architecture. The authors effectively address the challenges of self-supervised learning in low-resource languages by introducing a novel masking strategy and a precise synchronization between the masking manifold and the encoder's subsampling rate. The detailed exploration of Acoustic Stacking and Receptive Field Alignment demonstrates a strong understanding of the underlying mechanics of speech processing, enhancing the robustness of the learned representations.
The experimental evaluation is comprehensive, with the model pretrained on a substantial dataset of 17,000 hours of Vietnamese speech, which is a significant contribution to the field of low-resource language processing. The results across four downstream tasks—Automatic Speech Recognition, Speech Emotion Recognition, Dialect Classification, and Speaker Verification—show that ViP-VL achieves state-of-the-art performance. The use of rigorous evaluation metrics and a clear comparison with existing models strengthens the credibility of the findings.
The paper provides sufficient implementation details, including the architecture specifications, training protocols, and hyperparameters, which are critical for reproducibility. The authors also commit to releasing their pretrained weights and implementation, which is a positive step towards enabling other researchers to replicate and build upon their work.
While the paper presents significant advancements, it does not address potential limitations in terms of the model's performance across diverse acoustic environments or its adaptability to other low-resource languages. Additionally, the computational requirements for pretraining, although optimized, may still pose challenges for researchers with limited resources.
The development of ViP-VL has the potential to significantly impact the field of speech technology for Vietnamese and other low-resource languages. By providing a high-performance, publicly accessible model, the authors contribute to bridging the gap in speech technology for underserved linguistic communities, fostering further research and development in this area. The main contribution of this paper is the introduction of ViP-VL, a self-supervised speech pretraining model tailored for Vietnamese, which achieves state-of-the-art performance across multiple tasks while addressing key challenges in computational efficiency and representation robustness. This work not only advances the state of the art in speech processing for low-resource languages but also sets a foundation for future research in self-supervised learning methodologies.
Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.
Primary: Brno University of Technology
All Institutions: Brno University of Technology
The paper presents a novel approach to explainability in deepfake speech detection, offering valuable insights into the decision-making processes of various detectors. The comprehensive methodology and rigorous experimental evaluation contribute significantly to the field, paving the way for future advancements in audio forensics and detector design.
The paper introduces an innovative audio-native explainability pipeline using Integrated Gradients (IG) to analyze deepfake speech detectors. The methodology is robust, leveraging self-supervised learning (SSL) representations and providing a structured annotation protocol for semantic analysis. The use of causal masking to validate findings adds rigor to the approach, allowing for a clear understanding of the decision-making processes of different detectors.
The experiments are well-structured, utilizing the ASVspoof 5 dataset and a representative subset of recordings to evaluate the detectors. The results reveal distinct cue reliance among the detectors, supported by qualitative annotations and quantitative performance metrics. The findings highlight the complementary nature of the detectors, suggesting potential for ensemble methods.
The paper provides sufficient implementation details, including the use of the Captum library for IG calculations and a clear description of the annotation protocol. The availability of a GitHub repository enhances reproducibility, allowing other researchers to replicate the study and build upon the findings.
One limitation is the focus on SSL-based models, which may restrict the generalizability of the findings to other architectures. Additionally, the reliance on human annotations, while valuable, introduces subjectivity that could affect the consistency of the results. The shared vulnerability to audio compression across detectors also suggests a potential weakness in their robustness.
The findings have significant implications for audio forensics and the development of deepfake detection systems. By providing insights into the decision-making processes of detectors, the work can inform the design of more effective systems and enhance the interpretability of AI in critical applications. The potential for ensemble methods could lead to advancements in the field, improving detection accuracy and reliability. The paper presents a novel approach to explainability in deepfake speech detection, offering valuable insights into the decision-making processes of various detectors. The comprehensive methodology and rigorous experimental evaluation contribute significantly to the field, paving the way for future advancements in audio forensics and detector design.
Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.
Primary: Priberam Labs
All Institutions: Priberam Labs, Instituto de Telecomunicações, Instituto Superior Técnico
This paper presents a novel approach to open-vocabulary keyword spotting that significantly reduces memory usage and latency while maintaining performance. The technical contributions and methodology are well-articulated, addressing a critical challenge in automatic speech recognition systems, particularly in specialized domains.
The methodology proposed in this paper is innovative, focusing on embedding compression for keyword spotting in automatic speech recognition systems. The authors introduce a multi-faceted approach to reduce the memory footprint and latency of the system while maintaining performance, which is a significant advancement over existing methods. The use of an automated layer selection process and a lightweight compression mechanism demonstrates a thoughtful integration of techniques that address the limitations of previous systems.
The experiments are well-structured, utilizing a diverse set of datasets across multiple languages and contexts, which enhances the generalizability of the findings. The results indicate a clear improvement in efficiency (latency and memory usage) without sacrificing performance, showcasing the effectiveness of the proposed system. However, the evaluation could benefit from more extensive comparisons with additional baselines to further validate the claims.
The paper provides a link to the source code, which is a positive aspect for reproducibility. However, the details regarding hyperparameter settings and training procedures could be more explicit to ensure that other researchers can replicate the results accurately.
One limitation noted is the potential for hallucinations in the ASR output when using contextual biasing, particularly with uncurated glossaries. Additionally, the performance on the internal dataset was less favorable, indicating that further refinement is needed for real-world applications.
The proposed system has significant implications for industries relying on accurate speech recognition, such as healthcare and customer service, where specialized terminology is frequently used. By enabling efficient processing of large glossaries, this work could enhance the applicability of ASR systems in various domains. This paper presents a novel approach to open-vocabulary keyword spotting that significantly reduces memory usage and latency while maintaining performance. The technical contributions and methodology are well-articulated, addressing a critical challenge in automatic speech recognition systems, particularly in specialized domains.
Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.
Primary: Stellenbosch University
All Institutions: Stellenbosch University
The main contribution of this paper is the demonstration that graph-based clustering methods significantly outperform traditional centre-based approaches in recovering the Zipfian distribution in unsupervised term discovery. This work provides valuable insights into the clustering methodologies applicable in speech processing and highlights the importance of choosing appropriate clustering techniques to improve lexicon induction.
The paper presents a well-structured methodology for unsupervised term discovery, emphasizing the limitations of traditional centre-based clustering methods like K-means in recovering the Zipfian distribution of lexicons. The authors propose a novel graph-based clustering approach using the Leiden algorithm, which allows for more flexible and interpretable control over clustering outcomes through hyperparameters. The methodology is sound, with clear definitions of segmentation and clustering processes, and it effectively contrasts the performance of various clustering techniques across different languages and segmentation conditions.
The experimental setup is robust, utilizing three languages (English, Afrikaans, and French) and multiple segmentation conditions (ground-truth words, ground-truth syllables, and unsupervised syllabic segments). The results demonstrate a consistent advantage of bottom-up methods over centre-based approaches, with clear metrics (NED, NES, iNES) used to evaluate performance. The findings are well-supported by quantitative data and visualizations that illustrate the differences in induced lexicon structures.
The paper provides sufficient implementation details, including the use of specific libraries (e.g., igraph, FAISS, scikit-learn) and the configuration of parameters for each clustering method. However, the lack of a formalized reproducibility section and the absence of a demo URL may hinder complete reproducibility for external researchers.
The study primarily focuses on the clustering aspect of term discovery, potentially overlooking other factors that could influence lexicon quality, such as the segmentation process itself. Additionally, while the authors claim computational efficiency for graph clustering, the overall scalability of the proposed methods remains a concern, especially for larger datasets.
The findings have significant implications for the field of unsupervised learning and speech processing, particularly in low-resource settings where labeled data is scarce. By challenging the dominance of traditional clustering methods, this work encourages further exploration of graph-based approaches, which could lead to advancements in language acquisition models and speech technologies. The main contribution of this paper is the demonstration that graph-based clustering methods significantly outperform traditional centre-based approaches in recovering the Zipfian distribution in unsupervised term discovery. This work provides valuable insights into the clustering methodologies applicable in speech processing and highlights the importance of choosing appropriate clustering techniques to improve lexicon induction.
Speech Emotion Recognition (SER) aims to identify a speaker's emotional state from audio signals. While recent advances in deep learning have significantly improved SER performance in Indo-European languages, Arabic SER remains underexplored and challenging due to dialectal diversity, limited annotated datasets, and the difficulty of modeling both local spectral cues and long-range temporal dependencies. To address these limitations, this study investigates whether hybrid architectures that jointly model spatial and contextual information can improve emotion recognition in Arabic speech. We propose and evaluate a comparative framework involving three architectures: a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model. The first two models leverage MFCC and spectrogram-based representations, while wav2vec 2.0 operates directly on raw audio through self-supervised representations. Experiments conducted on the EYASE and BAVED datasets demonstrate that the proposed CNN-Transformer architecture significantly outperforms the other models, achieving an accuracy of 98.1 percent. This result highlights the effectiveness of combining convolutional feature extraction with Transformer-based global context modeling. The main contribution of this work lies in providing a systematic comparison of hybrid and self-supervised approaches for Arabic SER, and in demonstrating that CNN-Transformer architectures offer a robust solution for capturing both spectral and long-range dependencies in low-resource and dialectally diverse settings.
Primary: Université des Sciences et de la Technologie d'Oran Mohamed Boudiaf (USTO-MB)
All Institutions: Université des Sciences et de la Technologie d'Oran Mohamed Boudiaf (USTO-MB)
This work systematically investigates hybrid architectures for Arabic Speech Emotion Recognition, demonstrating that the CNN-Transformer model effectively captures both local and long-range emotional dependencies. The study's contributions are significant, addressing a gap in the literature and providing a robust framework for future research in this domain.
The paper presents a systematic comparison of three distinct deep learning architectures for Arabic Speech Emotion Recognition (SER): CNN-LSTM, CNN-Transformer, and wav2vec 2.0. The methodology is well-structured, focusing on the integration of local spectral feature extraction through CNNs with the global context modeling capabilities of Transformers. The use of a unified experimental framework ensures consistency in preprocessing and evaluation, which is a strength. However, while the hybrid approach is innovative, the individual contributions of each model could be explored in greater depth, particularly regarding their specific advantages in different contexts.
The experiments are comprehensive, utilizing two datasets (EYASE and BAVED) that reflect both spontaneous and controlled speech. The results demonstrate that the CNN-Transformer model significantly outperforms the other architectures, achieving an accuracy of 98.1%. The evaluation metrics are appropriate, including accuracy, precision, recall, and F1-score, which provide a well-rounded assessment of model performance. However, the paper could benefit from a more detailed analysis of misclassifications and error patterns to further elucidate the strengths and weaknesses of the proposed models.
The paper provides sufficient details regarding the experimental setup, including data preprocessing, training protocols, and evaluation metrics, which enhances reproducibility. The use of standard libraries (PyTorch and Hugging Face Transformers) is a positive aspect, as it facilitates implementation. However, the absence of a public code repository or demo URL limits accessibility for other researchers wishing to replicate the study.
The study acknowledges limitations, such as the reliance on a limited number of datasets, which may affect the generalizability of the findings. Additionally, while the CNN-Transformer model shows promise, its performance in highly noisy or conversational environments remains untested. The paper also does not explore cross-corpus evaluations, which are crucial for assessing the robustness of SER systems across different dialects and contexts.
The findings of this research have significant implications for applications in human-computer interaction, particularly in enhancing the emotional intelligence of virtual assistants and improving mental health monitoring systems. The focus on Arabic SER contributes to a critical area of research that has been underexplored, potentially leading to advancements in technology that can better understand and respond to diverse emotional expressions in speech. This work systematically investigates hybrid architectures for Arabic Speech Emotion Recognition, demonstrating that the CNN-Transformer model effectively captures both local and long-range emotional dependencies. The study's contributions are significant, addressing a gap in the literature and providing a robust framework for future research in this domain.
Speech-to-text (S2T) systems for recognition (ASR) and translation (S2TT) typically generate discrete text tokens. In contrast, continuous-target language modelling performs generation in a continuous space, yet its potential for S2T remains unexplored. To bridge this gap, we propose ELF-S2T, an audio-conditioned continuous-target generative model for S2T. Built upon the pre-trained Embedded Language Flows (ELF) backbone, ELF-S2T processes speech via a frozen Whisper encoder and a single linear projector, prepending the resulting audio condition to the noisy text latent for in-context, flow-matching denoising. To prevent the model from over-relying on its pre-trained text context, we introduce audio forcing during training, and further amplify the audio condition via classifier-free guidance at inference. Experiments on LibriSpeech and CoVoST2 show that ELF-S2T achieves competitive ASR and S2TT performance. Crucially, our error analysis reveals that, although ASR and S2TT errors look very different on the surface, both stem from the same underlying cause, a close distance confusion in the continuous latent space. This finding naturally aligns with the continuous representation generation paradigm, indicating a common semantic mapping process beneath recognition and translation. Our code and pretrained models are publicly available at https://github.com/Sslnon/ELF-S2T.
Primary: unknown
All Institutions: unknown
The paper presents ELF-S2T, an innovative approach to speech-to-text generation that utilizes continuous-target language modeling to enhance performance and understanding of error patterns in ASR and S2TT tasks. The methodology is well-structured, and the results indicate significant advancements in the field, though there are limitations regarding language diversity and computational efficiency that warrant further exploration.
The methodology presented in this paper is innovative in its approach to speech-to-text generation by leveraging continuous-target language modeling. The authors propose ELF-S2T, which integrates a frozen Whisper encoder with a flow-matching generative model, allowing for audio-conditioned text generation in a continuous space. The introduction of audio forcing and classifier-free guidance during training and inference is a significant methodological advancement that addresses the challenge of over-reliance on text context. The architecture is well-defined, and the use of a single linear projector to connect the audio condition to the text latent space is a clever design choice that enhances the model's performance.
The experiments conducted on the LibriSpeech and CoVoST2 datasets demonstrate the effectiveness of the proposed ELF-S2T model. The results show competitive performance in both ASR and S2TT tasks, with a notable improvement in BLEU and chrF scores compared to existing models. The error analysis provides valuable insights into the nature of errors in ASR and S2TT, revealing a common underlying cause related to the continuous latent space. The experiments are comprehensive and well-structured, showcasing the model's capabilities across different tasks.
The paper includes sufficient details regarding the training setup, model architecture, and evaluation metrics, which aids in reproducibility. The authors have made their code and pretrained models publicly available, further enhancing the potential for other researchers to replicate their work. However, the lack of detailed information about the specific configurations used for hyperparameters and training schedules could pose challenges for full reproducibility.
The paper acknowledges several limitations, including the performance gap compared to the autoregressive Whisper model and the computational expense associated with the iterative inference process. Additionally, the evaluation is limited to English recognition and German-English translation, leaving out other languages and noisy speech scenarios. These limitations suggest areas for future research and improvement.
The ELF-S2T model has the potential to significantly advance the field of speech recognition and translation by providing a more natural and efficient method for generating text from speech. Its continuous-target approach could lead to improvements in various applications, including real-time translation services, accessibility tools for the hearing impaired, and enhanced voice assistants. The findings regarding the common semantic mapping process between ASR and S2TT could also inspire further research into unified models for multimodal tasks. The paper presents ELF-S2T, an innovative approach to speech-to-text generation that utilizes continuous-target language modeling to enhance performance and understanding of error patterns in ASR and S2TT tasks. The methodology is well-structured, and the results indicate significant advancements in the field, though there are limitations regarding language diversity and computational efficiency that warrant further exploration.
The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.
Primary: Tsinghua University
All Institutions: Tsinghua University, Beijing Haitian Ruisheng Science Technology Ltd, Nexdata
The paper presents a significant advancement in multilingual automatic speech recognition by integrating a Mixture of Experts architecture and a dynamic downsampling mechanism, addressing critical challenges in modality alignment and cross-lingual adaptability. The methodology is well-structured, and the experimental results indicate a strong potential for real-world applications, although further validation across diverse datasets is needed.
The paper introduces a novel projector-based LLM-ASR framework that integrates a Mixture of Experts (MoE) architecture and a Continuous Integrate-and-Fire (CIF) mechanism. The MoE architecture allows for dynamic routing of inputs to specialized experts, enhancing multilingual adaptability, while the CIF mechanism provides a dynamic downsampling approach that improves modality alignment. This combination addresses critical challenges in multilingual ASR, showcasing a thoughtful approach to leveraging the strengths of LLMs in the context of speech recognition.
The experimental setup is robust, utilizing a substantial multilingual dataset and comparing the proposed model against strong baseline models. The results demonstrate significant performance improvements in terms of word error rates (WER) across various datasets, indicating the effectiveness of the proposed enhancements. The paper provides clear metrics and comparisons, although further exploration of the model's performance across different languages could enrich the findings.
The paper includes sufficient implementation details, such as the architecture of the models used and the training setup, which aids in reproducibility. The use of an open-sourced codebase for the baseline model is a positive aspect, although the lack of access to the specific datasets used for training and evaluation may hinder full reproducibility.
One limitation is the reliance on a single dataset for training and evaluation, which may not fully capture the diversity of multilingual ASR challenges. Additionally, while the CIF mechanism shows promise, its performance in real-world applications with varying speech rates and accents remains to be thoroughly validated.
The proposed framework has the potential to significantly advance the field of multilingual ASR, making it more accessible and effective across diverse linguistic contexts. This could have implications for global communication technologies, accessibility tools, and language learning applications. The paper presents a significant advancement in multilingual automatic speech recognition by integrating a Mixture of Experts architecture and a dynamic downsampling mechanism, addressing critical challenges in modality alignment and cross-lingual adaptability. The methodology is well-structured, and the experimental results indicate a strong potential for real-world applications, although further validation across diverse datasets is needed.
While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.
Primary: KAIST
All Institutions: KAIST
The main contribution of this paper is the introduction of MeCo, a one-step generative corrector that significantly improves multi-channel speech separation by integrating a novel MeanFlow-based approach with Data-Space Optimization, achieving state-of-the-art performance in both signal fidelity and human listening quality. The comprehensive analysis of the methodology, experiments, and implications highlights its significance in advancing the field of audio processing.
The proposed methodology introduces a novel MeanFlow-based one-step generative corrector (MeCo) that effectively addresses the limitations of existing discriminative models in multi-channel speech separation. By leveraging a conditional average velocity field and integrating Data-Space Optimization (DSO) with two complementary loss functions, the approach enhances both signal fidelity and human listening quality. The methodology is well-structured, with a clear explanation of how MeCo operates in the STFT domain and how it improves upon previous models like Fast-GeCo. The use of a one-step generation process is particularly innovative, as it reduces computational overhead while maintaining high performance.
The experimental evaluation is robust, with comprehensive testing on both in-domain and out-of-domain datasets, including the WSJ0 + WHAM! and Librispeech + DEMAND. The results demonstrate that MeCo achieves state-of-the-art performance across various metrics, including PESQ, ESTOI, SI-SDR, DNSMOS, and UTMOS. The inclusion of an ablation study further strengthens the findings, providing insights into the contributions of each component of the DSO. The paper successfully shows that MeCo outperforms existing models, particularly in human listening quality assessments, which is crucial for practical applications.
The implementation details are well-documented, with specifics on the training process, model configurations, and evaluation metrics. The authors provide a link to their GitHub repository, which enhances reproducibility. However, the paper could benefit from more detailed instructions on how to replicate the experiments, including data preprocessing steps and specific hyperparameter settings.
One identified limitation is that MeCo's independent speaker refinement relies on channel-wise concatenation, which may not be optimal in complex acoustic environments with overlapping speakers. Additionally, while the results are promising, the generalization to completely unfamiliar linguistic environments could be further explored with more diverse datasets.
The implications of this research extend to various applications in speech processing, particularly in enhancing communication systems, hearing aids, and voice recognition technologies. By improving the quality of multi-channel speech separation, the work has the potential to significantly enhance user experiences in real-world scenarios where clarity and naturalness of speech are paramount. The main contribution of this paper is the introduction of MeCo, a one-step generative corrector that significantly improves multi-channel speech separation by integrating a novel MeanFlow-based approach with Data-Space Optimization, achieving state-of-the-art performance in both signal fidelity and human listening quality. The comprehensive analysis of the methodology, experiments, and implications highlights its significance in advancing the field of audio processing.
Spoken language identification (LID) for Indian languages is a challenging problem due to the large number of languages, significant phonetic overlap among related varieties, and the scarcity of labeled data for many low-resource languages. In this work, we present a systematic comparative study of two pre-trained speech encoders -- Whisper and FastConformer -- combined with a linear classifier for large-scale Indic LID spanning 42 languages across four linguistic families. We evaluate both encoders in frozen (linear probing) and fine-tuned settings, and compare three training objectives: cross-entropy (CE), supervised contrastive loss with cross entropy (CE + supCon), and hierarchical softmax (HSM). Models are trained on the Vaani dataset and evaluated in a cross-corpus setting on Vaani-Test (held-out), FLEURS, and Kathbath, providing insights into domain generalization. The frozen FastConformer encoder achieves over 90\% macro accuracy on FLEURS and Kathbath without any task-specific adaptation, substantially outperforming Whisper on out-of-domain benchmarks, while fine-tuned Whisper yields stronger in-domain performance. HSM consistently outperforms CE and CE+SupCon for both encoders across all benchmarks, with the largest gains on out-of-domain test sets. CE+SupCon degrades FastConformer's cross-corpus generalization, suggesting that the contrastive objective over-specializes representations to in-domain conditions. Per-family analysis shows that Central Indo-Aryan varieties are the hardest to discriminate, with Hindi--Urdu and the Sadri--Chhattisgarhi--Surgujia cluster being the dominant confusion pairs.
Primary: AI & Robotics Technology Park (ARTPARK), IISc, Bangalore, India
All Institutions: AI & Robotics Technology Park (ARTPARK), IISc, Bangalore, India, Department of Electrical Engineering, Indian Institute of Science, Bangalore, India
The paper provides a systematic comparison of pre-trained speech encoders for large-scale Indic spoken language identification, revealing significant insights into the effectiveness of different training objectives and encoder architectures. The comprehensive evaluation and innovative methodology contribute meaningfully to the field of machine learning and audio processing, particularly in the context of low-resource languages.
The paper presents a systematic comparative study of two pre-trained speech encoders (Whisper and FastConformer) for spoken language identification (LID) in Indic languages. The methodology is robust, employing both frozen and fine-tuned settings for the encoders and comparing three training objectives (cross-entropy, supervised contrastive loss, and hierarchical softmax). The use of a large-scale dataset (Vaani) and cross-corpus evaluation enhances the validity of the findings. The hierarchical softmax approach is particularly innovative as it structures the language classification task in a way that reflects linguistic relationships, which is critical for LID tasks with high phonetic overlap.
The experiments are well-designed, utilizing a comprehensive dataset spanning 42 languages across four linguistic families. The evaluation metrics are appropriate, focusing on macro-averaged accuracy, which is crucial given the imbalanced nature of language representation in the dataset. The results demonstrate significant insights into the performance of the encoders and training objectives, with clear distinctions made between in-domain and out-of-domain performance. The findings regarding the performance of the FastConformer encoder in out-of-domain settings are particularly noteworthy and contribute to the understanding of generalization in LID tasks.
The paper provides sufficient implementation details, including model architectures, training procedures, and hyperparameter settings. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should consider releasing the models and code to facilitate further research and validation of the findings.
One limitation is the reliance on a single dataset (Vaani) for training, which may not capture the full diversity of acoustic conditions present in real-world scenarios. Additionally, while the hierarchical softmax approach shows promise, it may require careful tuning to optimize performance across all language families. The paper also notes that Central Indo-Aryan languages are particularly challenging, indicating a need for targeted strategies to improve identification accuracy in this subgroup.
This research has significant implications for multilingual speech processing applications, particularly in low-resource settings where labeled data is scarce. The insights gained from this study can inform the development of more effective language identification systems, enhancing accessibility and usability in voice-based technologies across diverse linguistic communities. The findings also contribute to the broader field of speech processing by highlighting the importance of encoder selection and training objectives in achieving robust performance in complex LID tasks. The paper provides a systematic comparison of pre-trained speech encoders for large-scale Indic spoken language identification, revealing significant insights into the effectiveness of different training objectives and encoder architectures. The comprehensive evaluation and innovative methodology contribute meaningfully to the field of machine learning and audio processing, particularly in the context of low-resource languages.
While speech quality is typically assessed on complete utterances, streaming and generative systems require incremental estimation from partial audio. Existing predictors assume full context, degrading on prefix-constrained inputs. Extending ARECHO, we propose ANCHOR, reformulating incremental assessment as a multi-resolution autoregressive task. It models chunk- and utterance-level quality within a single decoder using dual-resolution tokens and a resolution-aware hierarchy for coarse-to-fine refinement. Experiments show substantial robustness under partial input, including a 48% PLCMOS error reduction on 2-second prefixes. Convergence analysis reveals a 4-6 s effective perceptual context horizon. A stress test further isolates structured extrapolation biases under localized corruption. Results demonstrate that hierarchical supervision improves incremental prediction and elucidates how perceptual quality accumulates over time.
Primary: University of Southern California
All Institutions: University of Southern California, Carnegie Mellon University
The paper presents ANCHOR, an innovative framework for joint chunk-level and full-utterance speech quality modeling, which significantly enhances the ability to assess speech quality in real-time scenarios. The comprehensive evaluation of the methodology, experimental results, and implications for the field underscores its potential impact on future speech processing technologies.
The methodology presented in the paper is innovative, introducing ANCHOR as a multi-resolution autoregressive framework that effectively addresses the limitations of existing speech quality assessment models by allowing for incremental predictions from partial audio inputs. The dual-resolution token approach and the resolution-aware decoding hierarchy are significant advancements that enhance the model's ability to handle localized distortions and provide more accurate quality assessments in real-time scenarios. The structured coarse-to-fine refinement process is particularly noteworthy, as it mitigates the common pitfalls of supervision conflict between local and global quality metrics.
The experimental evaluation is robust, utilizing a comprehensive dataset that includes a variety of speech types, and the results demonstrate substantial improvements in predictive accuracy under prefix-constrained conditions. The authors effectively show the advantages of their approach through quantitative metrics such as MAE, PCC, and SRCC, highlighting the model's performance against a strong baseline (ARECHO). The stress tests designed to isolate extrapolation behavior under controlled distortions further validate the model's robustness and its ability to maintain perceptual quality under challenging conditions.
The paper provides sufficient detail regarding the experimental setup, including dataset descriptions, prefix construction methods, and hyperparameter settings. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. The authors mention the use of a pretrained model, which is a positive aspect, but without access to the implementation, full reproducibility remains a challenge.
One identified limitation is that ANCHOR relies on a non-causal frontend, which means it does not fully support streaming applications. Additionally, while the authors acknowledge the absence of formal component-wise ablations, the reliance on indirect evidence for the contributions of the decoding hierarchy could be seen as a weakness in establishing the model's effectiveness.
The potential applications of ANCHOR are significant, particularly in real-time communication systems where speech quality is critical, such as video conferencing and voice-over-IP services. The ability to provide accurate quality assessments incrementally could lead to improved user experiences and more efficient bandwidth usage in streaming applications. The findings could also influence future research directions in non-intrusive quality assessment and speech processing. The paper presents ANCHOR, an innovative framework for joint chunk-level and full-utterance speech quality modeling, which significantly enhances the ability to assess speech quality in real-time scenarios. The comprehensive evaluation of the methodology, experimental results, and implications for the field underscores its potential impact on future speech processing technologies.
Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.
Primary: Tencent
All Institutions: Tencent, Tsinghua University
The paper presents Conan-embedding-v3, an innovative omni-modal embedding model that effectively addresses the challenges of cross-modal retrieval through a decoupled training and fusion approach. The identification of Projector Drift and the subsequent recovery strategies represent meaningful advancements in the field of multi-modal machine learning.
The methodology presented in this paper is innovative, particularly in its Decoupled Specialist Fusion approach, which allows for independent training of modality specialists before fusing them into a single model. The identification of Projector Drift as a significant issue when merging modalities is a notable contribution, and the proposed Projector Recovery method effectively addresses this problem. The multi-modal rehearsal phase further enhances the model's robustness. However, the complexity of the methodology may pose challenges for implementation and understanding.
The experimental evaluation is thorough, utilizing robust benchmarks (MMEB and MAEB) to assess the model's performance across multiple modalities. The results indicate that Conan-embedding-v3 achieves competitive scores, particularly in audio retrieval, while maintaining strong performance in visual tasks. The ablation studies provide valuable insights into the contributions of each component of the methodology, reinforcing the effectiveness of the proposed approach.
The paper provides detailed descriptions of the training process, data sources, and evaluation metrics, which aids in reproducibility. However, the lack of publicly available code or a demo URL limits the ease with which other researchers can replicate the results.
The paper acknowledges limitations such as the incomplete closure of Projector Drift and the reliance on empirical searches for recovery configurations. Additionally, the potential interactions of multiple projector-drift effects when adding more modalities are not fully explored, which could impact the scalability of the approach.
The proposed model has significant implications for omni-modal retrieval systems, potentially enhancing applications in multimedia search engines, content recommendation systems, and cross-modal understanding tasks. The modular design allows for easier integration of new modalities, which could lead to more versatile AI systems. The paper presents Conan-embedding-v3, an innovative omni-modal embedding model that effectively addresses the challenges of cross-modal retrieval through a decoupled training and fusion approach. The identification of Projector Drift and the subsequent recovery strategies represent meaningful advancements in the field of multi-modal machine learning.
Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.
Primary: University of the Basque Country (UPV/EHU)
All Institutions: University of the Basque Country (UPV/EHU), Universitat Politècnica de València (UPV)
The main contribution of this paper is the development of a masked multimodal speech synthesis framework that effectively integrates sEMG and lipreading signals, demonstrating significant improvements in robustness and performance for silent speech interfaces. This work represents a meaningful advancement in the field of assistive technologies, particularly for individuals with speech impairments.
The proposed methodology integrates sEMG and lipreading through a masked multimodal framework, which is innovative in its approach to enhance robustness against modality degradation. The use of temporal adaptive masking during training is a significant methodological contribution, as it encourages the model to learn complementary representations from both modalities. The dual-stream architecture and structured cross-modal masking are well-justified and effectively implemented. The paper also provides a clear problem formulation and detailed descriptions of the encoding and fusion processes, which are critical for understanding the model's operation.
The experiments are comprehensive, utilizing a well-structured dataset that includes both laryngeal and laryngectomized speakers. The results demonstrate significant improvements in word error rates and intelligibility when using the proposed multimodal approach compared to unimodal baselines. The phone-level analyses provide valuable insights into the contributions of each modality, highlighting the strengths and weaknesses of the system across different phonetic classes. The robustness tests under low-bitrate conditions further validate the practical applicability of the approach.
The paper includes sufficient implementation details, including architecture specifications, training procedures, and preprocessing steps, which enhance reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers. The dataset is mentioned to be publicly accessible, which is a positive aspect for reproducibility.
One limitation noted is the challenge of adapting the system to laryngectomized individuals, which is attributed to the variability in speech production and the lack of corresponding natural voice recordings. Additionally, the performance degradation observed when applying masking in unimodal settings suggests that the approach may not be universally beneficial across all contexts. The paper also acknowledges that while the proposed method shows promise, further exploration is needed to address the adaptation challenges.
The research has significant implications for assistive technologies, particularly for individuals with speech impairments. By improving silent speech synthesis through non-invasive methods, the work can enhance communication for those who have lost their ability to speak due to medical conditions. The findings could lead to more effective and accessible speech restoration technologies, potentially improving the quality of life for many individuals. The main contribution of this paper is the development of a masked multimodal speech synthesis framework that effectively integrates sEMG and lipreading signals, demonstrating significant improvements in robustness and performance for silent speech interfaces. This work represents a meaningful advancement in the field of assistive technologies, particularly for individuals with speech impairments.
Evaluating text-to-music (TTM) systems remains expensive because music impression (MI) and text alignment (TA) scores rely on human mean opinion scores (MOS). Most automatic MOS estimators are trained with point-wise regression or distributional classification. These objectives do not directly optimize rank-based metrics and provide weak geometric constraints for cross-modal coherence. To address these gaps, we propose DeRA-MOS, a decoupled optimization framework for TTM evaluation. For MI, we introduce a batch-aware listwise ranking loss that models relative order within each mini-batch and better aligns with evaluation based on Spearman's rank correlation coefficient (SRCC). For TA, we introduce a score-anchored modality alignment loss that maps human scores to target audio-text similarity and regularizes the latent space before fusion. By effectively mitigating the point-wise training mismatch and modality drift, experiments on MusicEval demonstrate that our decoupled framework yields substantial improvements in both MI and TA ranking metrics, establishing a robust paradigm for large-scale TTM evaluation.
Primary: Institute of Information Science, Academia Sinica
All Institutions: Institute of Information Science, Academia Sinica, Department of Computer Science and Information Engineering, National Taiwan Normal University, E.SUN Financial Holding Co., United Link Co.
The main contribution of this paper is the introduction of DeRA-MOS, a decoupled optimization framework that significantly enhances the evaluation of text-to-music systems by addressing key challenges in aligning automatic metrics with human judgment. This work represents a meaningful advancement in the field, combining innovative methodology with rigorous experimental validation to improve the assessment of generative audio systems.
The methodology proposed in DeRA-MOS is innovative as it introduces a decoupled optimization framework specifically tailored for text-to-music evaluation. The introduction of the Batch-Aware Listwise Ranking (BALR) loss and the Score-Anchored Modality Alignment (SAMA) loss addresses significant gaps in traditional evaluation methods by focusing on relative ranking and geometric consistency. This dual-loss approach enhances the model's ability to align with human judgment, which is crucial for subjective evaluations like music impression and text alignment. The paper effectively justifies the need for these methods and provides a clear theoretical foundation for their implementation.
The experimental evaluation is robust, utilizing the MusicEval dataset, which is a relevant benchmark for assessing TTM systems. The authors provide comprehensive results demonstrating significant improvements in both music impression (MI) and text alignment (TA) metrics compared to baseline models. The use of multiple evaluation metrics (SRCC, KTAU, MSE) adds depth to the analysis, and the statistical significance testing strengthens the claims of improvement. The experiments are well-structured, and the results convincingly showcase the advantages of the proposed framework.
The paper includes sufficient implementation details, such as the architecture used, training configurations, and hyperparameter settings, which facilitate reproducibility. The authors adhere to standardized data splits for evaluation, enhancing the reliability of their results. The availability of the code on GitHub further supports reproducibility efforts, allowing other researchers to validate and build upon their work.
One limitation noted in the paper is the reliance on the MusicEval dataset, which may restrict the generalizability of the findings. The authors acknowledge that their evaluation is currently limited to this dataset and express the need for broader validation across diverse datasets in future work. Additionally, while the decoupled approach shows promise, the potential trade-offs between global ranking and local accuracy could be further explored.
The proposed framework has significant implications for the field of audio and music evaluation, particularly in enhancing the scalability and efficiency of TTM systems. By improving automatic evaluation methods, the research can facilitate the development of more sophisticated generative models and applications in music synthesis, potentially impacting areas such as interactive music generation, personalized music recommendations, and audio content creation. The methods introduced could also be adapted for other multimodal evaluation tasks, broadening their applicability. The main contribution of this paper is the introduction of DeRA-MOS, a decoupled optimization framework that significantly enhances the evaluation of text-to-music systems by addressing key challenges in aligning automatic metrics with human judgment. This work represents a meaningful advancement in the field, combining innovative methodology with rigorous experimental validation to improve the assessment of generative audio systems.
Recent state-of-the-art (SOTA) text-to-speech (TTS) systems typically adopt a cascaded pipeline consisting of a speech tokenizer, an autoregressive large language model (LLM), and a diffusion based flow-matching (FM) model, with these components trained independently. In this paper, we propose a fully end-to-end (E2E) optimization framework that unifies the training of the speech tokenizer, LLM, FM model, and an additional reward model (RM). Specifically, we first jointly optimize the tokenizer using multi-task objectives derived from reconstruction for FM, next-token prediction for LLM, and multi recognition task for RM. This joint training encourages the discrete speech token space to capture acoustically and semantically salient information that is better tailored to TTS. We then further optimize the LLM using downstream reconstruction and recognition by FM and RM, which reduces inference-time mismatch and steers the LLM toward more preferred generations. Experimental results show that our E2E framework consistently outperforms cascaded baselines. On the Seed-TTS-Eval benchmark, our system achieves a word error rate (WER) of 0.78% and 1.56%, a new SOTA result with a 0.6B-parameter LLM and 0.5B-parameter FM model. These results validate that holistic E2E optimization is critical for improving discrete-token-based TTS systems with a much simpler training pipeline.
Primary: Tencent Yuanbao
All Institutions: Tencent Yuanbao
The main contribution of this paper is the introduction of a novel end-to-end training framework for discrete token-based TTS systems, which significantly enhances performance by jointly optimizing the speech tokenizer, LLM, FM model, and reward model. This comprehensive approach addresses critical limitations of traditional cascaded TTS systems, paving the way for more efficient and effective speech synthesis technologies.
The proposed end-to-end (E2E) training framework for TTS systems is innovative, as it integrates multiple components (speech tokenizer, LLM, FM model, and RM) into a unified optimization process. The methodology emphasizes joint training to address the limitations of traditional cascaded approaches, which often lead to performance degradation due to independent training. The use of multi-task objectives and the Gumbel-Softmax for gradient propagation through discrete sampling is a notable technical contribution that enhances the training dynamics and model performance.
The experimental results are robust, demonstrating significant improvements over existing SOTA models on the Seed-TTS-Eval benchmark. The paper provides comprehensive evaluations, including word error rates (WER) and speaker similarity metrics, which are critical for assessing TTS quality. The systematic analysis of different training stages and their impact on performance further strengthens the findings.
The paper outlines a clear training pipeline and methodology, detailing the stages of training and the specific loss functions used. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Providing access to the trained models and datasets would enhance the ability for others to validate the findings.
One limitation is the reliance on a single dataset for training and evaluation, which may affect the generalizability of the results. Additionally, while the paper discusses the importance of joint optimization, it does not deeply explore the potential computational costs and complexities associated with this approach, which could be a barrier for practical implementation.
The proposed E2E TTS framework has the potential to significantly advance the field of speech synthesis by improving the quality and expressiveness of generated speech. This could have wide-ranging applications in virtual assistants, audiobooks, and other areas where natural-sounding speech is essential. The integration of multimodal capabilities through LLMs also opens avenues for more interactive and context-aware audio systems. The main contribution of this paper is the introduction of a novel end-to-end training framework for discrete token-based TTS systems, which significantly enhances performance by jointly optimizing the speech tokenizer, LLM, FM model, and reward model. This comprehensive approach addresses critical limitations of traditional cascaded TTS systems, paving the way for more efficient and effective speech synthesis technologies.
Recent progress in speech dialogue systems requires Text-to-Speech (TTS) models to be faster and more responsive. Modern speech dialogue systems impose two primary requirements on TTS models: low latency and support for streaming inputs and outputs. However, most existing single-codebook LLM-based TTS methods rely on multi-stage pipelines that lack native streaming capabilities. These systems typically suffer from high end-to-end latency due to slow autoregressive prediction and multi-step flow matching. To address these limitations, we propose FlashTTS, an open-source and low-latency streaming TTS framework. FlashTTS introduces a lagged multi-track architecture that natively processes streaming text and speech inputs, thereby eliminating the need for sentence-level buffering. To accelerate acoustic generation, we integrate parallel Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder. This configuration achieves high-fidelity token-to-mel generation in exactly two function evaluations (2-NFE). By jointly optimizing input processing and decoding efficiency, FlashTTS offers a practical foundation for real-time speech dialogue systems. Experiments show that FlashTTS substantially reduces First-Packet Latency to 325ms compared to robust streaming baselines, all while preserving strong zero-shot voice cloning and cross-lingual intelligibility. Speech samples are available. The model code and checkpoints will be released as open source.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Huawei Technologies Co., Ltd
FlashTTS represents a substantial advancement in low-latency TTS systems, effectively addressing the challenges of real-time speech synthesis through innovative architectural design and optimization strategies. The comprehensive evaluation and commitment to open-source release further solidify its potential impact on the field of speech processing.
The methodology presented in FlashTTS is innovative, focusing on a lagged multi-track architecture that supports real-time streaming TTS. The integration of Multi-Token Prediction (MTP) with an X-pred mean flow matching decoder is a significant advancement, allowing for high-fidelity audio generation with minimal latency. The paper effectively addresses the limitations of existing TTS systems by eliminating sentence-level buffering and optimizing both input processing and decoding efficiency. The design choices are well-justified, and the architecture is clearly articulated, making it a robust contribution to the field.
The experimental evaluation is thorough, utilizing a substantial dataset of 300,000 hours of speech data and multiple benchmarks for performance assessment. The results demonstrate a significant reduction in First-Packet Latency (FPL) and First-Token Latency (FTL) compared to existing models, while maintaining strong performance in terms of intelligibility and quality across multiple languages. The use of both objective metrics (WER, SIM) and subjective evaluations (CMOS) provides a comprehensive view of the model's performance.
The paper provides sufficient details regarding the model architecture, training setup, and evaluation metrics, which enhances reproducibility. The authors also commit to releasing the model code and checkpoints as open source, further supporting the community's ability to replicate and build upon their work.
While the paper presents a strong case for the effectiveness of FlashTTS, it does not thoroughly address potential limitations in terms of scalability or the impact of varying input complexities on performance. Additionally, the reliance on a specific architecture (Qwen2-0.5B) may limit generalizability to other frameworks or applications.
The implications of FlashTTS are significant, particularly for real-time speech dialogue systems, which are increasingly important in various applications, including virtual assistants, customer service bots, and interactive gaming. The ability to generate high-fidelity speech with low latency could enhance user experiences across these domains, making technology more accessible and effective. FlashTTS represents a substantial advancement in low-latency TTS systems, effectively addressing the challenges of real-time speech synthesis through innovative architectural design and optimization strategies. The comprehensive evaluation and commitment to open-source release further solidify its potential impact on the field of speech processing.
Flow-matching transformers achieve strong audio separation, yet their attention dynamics are opaque. We adapt established causal-intervention principles into a deterministic, inference-time probing protocol for SAM Audio. Orthogonal probing uncovers a dual-pathway text-conditioning mechanism: additive injections control semantic identity, while cross-attention refines acoustic structure. We observe an asynchronous layerwise convergence: stable layers build temporal scaffolds early, whereas fast layers continue resolving artifacts during sampling. The model also attenuates temporal segmentation cues to maintain continuous-flow stability. Using these insights, we propose Layer-Selective Attention Caching (LSAC), a training-free acceleration method that caches attention in stable layers. Across acoustic complexities, LSAC cuts self-attention computation by about ~25% with negligible quality loss and yields up to 6.7x higher quality retention than naive step reduction.
Primary: Hunan University
All Institutions: Hunan University, Jilin University, University of Electronic Science and Technology of China
The main contribution of this paper is the introduction of a causal-intervention probing framework that elucidates the attention dynamics in audio separation models, leading to the development of an efficient attention caching method that significantly improves computational performance while maintaining audio quality. This work represents a meaningful step forward in understanding and optimizing audio foundation models, with potential applications across various domains in machine learning and audio processing.
The paper introduces a novel deterministic probing protocol for understanding attention dynamics in audio separation models. The adaptation of causal-intervention principles is a significant methodological advancement, allowing for a clearer interpretation of how different components of the model interact. The use of orthogonal probing to isolate the dual pathways of text conditioning is innovative and provides a strong foundation for further exploration in audio processing. The proposed Layer-Selective Attention Caching (LSAC) method is practical and offers a clear engineering solution to improve computational efficiency without sacrificing quality.
The experimental setup is robust, employing a comprehensive evaluation across three complexity tiers with over 10,000 independent runs. The metrics used for evaluation, including Scale-Invariant Signal-to-Noise Ratio and Short-Time Objective Intelligibility, are appropriate for the audio domain. The results convincingly demonstrate the effectiveness of LSAC, with statistically significant findings that support the claims made regarding the dual-pathway conditioning mechanism and the asynchronous convergence of layers.
While the paper provides a detailed methodology, including equations and experimental conditions, there is no mention of code availability or specific datasets used, which could hinder reproducibility. The lack of a project URL or demo further complicates the ability for others to replicate the findings.
The study is confined to the SAM Audio family, which limits the generalizability of the findings. Additionally, the implications of the identified suppression of temporal segmentation capabilities are not fully explored, leaving questions about the broader applicability of the model's behavior in different contexts.
The findings have significant implications for the development of audio separation models and could influence future research in multimodal audio processing. The insights into attention dynamics and the proposed caching method could lead to more efficient audio processing systems, with potential applications in real-time audio applications, such as virtual assistants and automated transcription services. The main contribution of this paper is the introduction of a causal-intervention probing framework that elucidates the attention dynamics in audio separation models, leading to the development of an efficient attention caching method that significantly improves computational performance while maintaining audio quality. This work represents a meaningful step forward in understanding and optimizing audio foundation models, with potential applications across various domains in machine learning and audio processing.
Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.
Primary: University of Maryland, Baltimore County
All Institutions: University of Maryland, Baltimore County
The main contribution of this paper is the introduction of the LinguAS dataset, which integrates expert-defined linguistic features to enhance the detection of audio deepfakes. This innovative approach not only improves model performance but also emphasizes the importance of linguistic insights in machine learning applications, marking a significant advancement in the field of audio processing and deepfake detection.
The methodology is robust, focusing on the creation of the LinguAS dataset, which includes expert-defined linguistic features (EDLFs) that augment traditional audio features. The authors successfully integrate linguistic insights into the dataset creation process, which is a significant methodological advancement. The use of expert linguists to annotate audio samples adds credibility and depth to the dataset, allowing for a nuanced understanding of the features that differentiate genuine and spoofed audio. The systematic approach to feature selection and the inclusion of diverse attack types further enhance the dataset's applicability.
The experimental evaluation is thorough, demonstrating the effectiveness of the EDLFs in improving model performance across various machine learning methods. The authors provide clear comparisons to baseline models, showing significant performance improvements. The use of multiple evaluation metrics, including AUC and EER, adds rigor to the analysis. The experiments are well-structured, and the results are presented in a clear and interpretable manner, showcasing the advantages of the LinguAS dataset.
The paper mentions that the data and code are publicly available, which is crucial for reproducibility. However, specific URLs for the dataset and code repositories are not provided in the text, which limits the ability to directly access the resources. The detailed description of the methodology and experiments allows for replication, but the lack of direct links is a drawback.
The primary limitation noted in the paper is the focus on English language samples, which may restrict the generalizability of the findings to other languages and linguistic contexts. Additionally, the reliance on expert annotation may introduce biases based on the annotators' perceptions. The authors acknowledge the need for future work to expand the dataset and explore additional linguistic features.
The LinguAS dataset has the potential to significantly impact the field of audio deepfake detection by providing a more nuanced understanding of linguistic features that can be leveraged for model training. By emphasizing the importance of linguistic cues, the research encourages a shift towards more interpretable and explainable models in deepfake detection. The ethical considerations discussed, particularly regarding representation and bias, highlight the importance of inclusivity in AI research. The main contribution of this paper is the introduction of the LinguAS dataset, which integrates expert-defined linguistic features to enhance the detection of audio deepfakes. This innovative approach not only improves model performance but also emphasizes the importance of linguistic insights in machine learning applications, marking a significant advancement in the field of audio processing and deepfake detection.
Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.
Primary: The University of New South Wales
All Institutions: Northwestern Polytechnical University, The University of New South Wales, WeNet Open Source Community
The main contribution of this paper is the introduction of MeanVC 2, a robust low-latency streaming zero-shot voice conversion system that significantly improves upon its predecessor by addressing key limitations through innovative methodologies. The technical contributions, particularly in the areas of training efficiency and robustness to low-quality audio, position this work as a valuable advancement in the field of audio processing and machine learning.
The paper introduces two significant innovations: Future-Receptive Chunking (FRC) and the Universal Timbre Token Encoder (UTTE). FRC enhances the training efficiency and stability of voice conversion by optimizing the attention mechanism across chunks, allowing for effective processing with smaller chunk sizes. UTTE decouples timbre extraction from the reliance on high-quality reference audio, improving robustness against low-quality inputs. This dual approach addresses critical limitations of the previous MeanVC model, making the methodology both innovative and practical for real-time applications.
The authors conducted extensive experiments comparing MeanVC 2 with its predecessor and other baseline systems. The results demonstrate significant improvements in audio quality, speaker similarity, and reduced latency, with MeanVC 2 achieving a first-packet latency of 110 ms compared to 211 ms for MeanVC. The use of subjective metrics (NMOS, SMOS) and objective metrics (CER, DNSMOS) provides a comprehensive evaluation of performance, reinforcing the effectiveness of the proposed methods.
The paper mentions that the source code will be publicly released, which is a positive indicator for reproducibility. However, detailed implementation specifics, such as hyperparameters and training configurations, are not exhaustively covered, which could pose challenges for independent replication of the results.
While the proposed methods show substantial improvements, the paper does not address potential limitations in terms of the generalizability of the model across diverse languages or accents. Additionally, the reliance on a specific dataset (Emilia corpus) may limit the applicability of the findings to other datasets or real-world scenarios.
The advancements in low-latency voice conversion have significant implications for various applications, including real-time communication, dubbing, and assistive technologies for individuals with speech impairments. The ability to perform zero-shot voice conversion with improved robustness could enhance user experiences in virtual environments and multimedia content creation. The main contribution of this paper is the introduction of MeanVC 2, a robust low-latency streaming zero-shot voice conversion system that significantly improves upon its predecessor by addressing key limitations through innovative methodologies. The technical contributions, particularly in the areas of training efficiency and robustness to low-quality audio, position this work as a valuable advancement in the field of audio processing and machine learning.
Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.
Primary: National Technical University of Athens
All Institutions: National Technical University of Athens
The paper presents a novel multi-branch deep learning framework for Parkinson's disease detection from speech, integrating multiple modalities through a context-guided attention mechanism. This approach significantly enhances the robustness and accuracy of speech analysis for clinical applications, addressing critical limitations in existing methodologies.
The proposed methodology leverages a multi-branch deep learning architecture that integrates three distinct modalities—Log-Mel spectrograms, MFCCs, and HuBERT embeddings—through a context-guided cross-modal attention mechanism. This innovative approach addresses the limitations of single-representation models by dynamically focusing on the most informative features across different representations, which is a significant advancement in the field of speech analysis for Parkinson's disease detection. The use of pre-trained models and the attention mechanism enhances the model's ability to capture complex interactions in the data, making it a robust framework for clinical applications.
The experiments are rigorously designed, employing a speaker-independent 5-fold cross-validation strategy on the Spanish PC-GITA corpus, which is a relevant dataset for the task. The reported metrics, including accuracy, F1-score, and AUC, demonstrate strong performance, with the proposed method achieving an accuracy of 91.51%. The ablation studies further validate the contributions of the context-guided attention mechanism and the integration of multiple modalities, reinforcing the robustness of the experimental setup.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or links to a repository. This absence may hinder reproducibility for other researchers looking to validate or build upon the findings.
The study is limited by its reliance on a single dataset (PC-GITA), which may affect the generalizability of the results to other populations or languages. Additionally, the model's performance may be influenced by the quality and variability of the speech data, and the paper does not address potential biases in the dataset.
The findings have significant implications for the early detection of Parkinson's disease through non-invasive speech analysis. The proposed framework could facilitate large-scale screening and monitoring of patients, potentially leading to improved clinical outcomes. Furthermore, the methodology could be adapted for other neurodegenerative disorders, expanding its applicability in the medical field. The paper presents a novel multi-branch deep learning framework for Parkinson's disease detection from speech, integrating multiple modalities through a context-guided attention mechanism. This approach significantly enhances the robustness and accuracy of speech analysis for clinical applications, addressing critical limitations in existing methodologies.
Speech foundation models enable strong general-purpose ASR and are attractive for downstream adaptation. However, their size and the catastrophic forgetting induced by sequential fine-tuning demand parameter-efficient and regularized training methods, motivating parameter-efficient continual learning (PECL). While PECL has been widely studied in NLP and vision, it has received less attention in ASR. In this paper, we propose a simple yet effective PECL method based on recent advances in parameter-efficient fine-tuning for ASR. We partition pretrained weight matrices into head and tail subspaces according to singular values and restrict adaptation to approximate rotations within the low-energy tail subspace, preserving dominant components and reducing forgetting. For subsequent tasks, rotations are combined via weight averaging to further improve retention. Experiments on two benchmarks demonstrate reduced forgetting and superior overall performance compared to recent PECL baselines.
Primary: Department Electrical Engineering ESAT-PSI, KU Leuven
All Institutions: Department Electrical Engineering ESAT-PSI, KU Leuven
The paper presents a novel approach to parameter-efficient continual learning for ASR, demonstrating significant advancements in reducing forgetting while maintaining performance across tasks. The methodology is well-founded, and the experimental results validate its effectiveness, marking a meaningful contribution to the field.
The proposed method, Continual SSVD (CSSVD), introduces a novel approach to parameter-efficient continual learning (PECL) for automatic speech recognition (ASR) by partitioning weight matrices into head and tail subspaces based on singular values. This innovative partitioning allows the model to adapt to new tasks while minimizing catastrophic forgetting by restricting updates to the low-energy tail subspace. The methodology is well-structured and builds on existing techniques like structured SVD, demonstrating a clear understanding of the challenges in ASR adaptation. The combination of approximate rotations and weight averaging further enhances the model's ability to retain previously learned knowledge.
The experiments conducted on two benchmarks provide compelling evidence of the effectiveness of CSSVD. The results show significant improvements in word error rates (WER) and reductions in forgetting compared to various PECL baselines, including LoRA and SSVD. The ablation study adds depth to the evaluation, confirming the importance of the proposed adaptations and their impact on performance. The comprehensive nature of the experiments strengthens the paper's claims and demonstrates the method's robustness across different tasks.
The paper includes sufficient details regarding the implementation, including the model architecture, training procedures, and data sources. The availability of code on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings. However, the paper could benefit from more explicit descriptions of hyperparameter choices and their tuning processes.
One limitation of the study is the assumption that task identity is unavailable at inference time, which may not reflect all practical scenarios. Additionally, while the method shows promise, it treats all layers uniformly, which could be suboptimal for tasks with varying layer importance. Future work could explore more selective adaptation strategies that prioritize layers based on task relevance.
The proposed CSSVD method has significant implications for the field of ASR, particularly in applications requiring continual learning without access to previous task data. By addressing the challenges of catastrophic forgetting and parameter efficiency, this work paves the way for more adaptable and robust speech recognition systems. The findings could influence future research in related areas, including NLP and computer vision, where continual learning is also a critical concern. The paper presents a novel approach to parameter-efficient continual learning for ASR, demonstrating significant advancements in reducing forgetting while maintaining performance across tasks. The methodology is well-founded, and the experimental results validate its effectiveness, marking a meaningful contribution to the field.
Acoustic metamaterial (AMM) inverse design is particularly challenging for broadband target responses due to acoustic dispersion: a structure that matches the desired response at one frequency may deviate at others, and modifying geometry to improve one sub-band often perturbs neighboring sub-bands. Yet existing broadband inverse-design approaches are either constrained by predefined templates, or rely on image representations that fail to preserve the geometric precision and structural connectivity required by acoustic structures. We present MetaSeq, a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design. At its core, MetaSeq introduces a language that represents each AMM as a structured sequence, rather than as a pixel grid or fixed template. This representation preserves precise geometry, explicitly encodes connectivity, and casts inverse design as a sequence-to-sequence task from target response to structure sequence. MetaSeq further constructs a balanced, high-fidelity dataset with efficient calibration and complexity-based sampling. To address the one-to-many nature of inverse design, MetaSeq combines supervised pretraining with reinforcement learning fine-tuning guided by a physics-based solver and validity checker. Extensive evaluations against COMSOL and five baselines show that MetaSeq reduces response error by 45% over the best baseline.
Primary: National University of Singapore
All Institutions: National University of Singapore, UT Austin
The main contribution of this paper is the development of a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design that significantly improves upon existing methods by preserving geometric precision and structural connectivity. The comprehensive analysis highlights the innovative approach, rigorous methodology, and potential impact on the field of acoustic metamaterials and machine learning applications in engineering.
The proposed methodology introduces a structured sequence representation for acoustic metamaterial (AMM) inverse design, which is a significant departure from traditional image-based or template-constrained methods. The authors effectively utilize a domain-specific language (DSL) to encode geometric and topological information, allowing for a more flexible and precise representation of AMMs. The combination of supervised pretraining and reinforcement learning fine-tuning is innovative, addressing the one-to-many nature of inverse design while ensuring structural validity. The approach is well-structured and thoughtfully designed to tackle the inherent challenges of acoustic dispersion and the need for broadband responses.
The experiments are extensive, comparing the proposed framework (MetaSeq) against multiple baselines, including COMSOL simulations and traditional optimization methods. The reported results demonstrate a substantial reduction in response error (45% improvement) and improvements in training efficiency and inference latency. However, specific metrics such as the mean squared error (MSE) and other quantitative evaluations are not fully detailed in the provided text, which could enhance the understanding of the framework's performance.
The paper lacks explicit implementation details, such as code availability or links to datasets, which are crucial for reproducibility. While the methodology is described in detail, without access to the code or datasets, it would be challenging for other researchers to replicate the results. Providing a GitHub repository or similar resource would significantly improve this aspect.
One limitation is the reliance on a specific physical solver, which may not generalize across all types of acoustic metamaterials. Additionally, the complexity of the DSL and the training process may pose a barrier to entry for practitioners unfamiliar with such methods. The paper does not address potential scalability issues when applied to larger or more complex designs.
The framework has significant implications for the design of acoustic metamaterials, which are increasingly relevant in various applications, including noise control, sound absorption, and advanced acoustic devices. By enabling more flexible and efficient design processes, this research could lead to innovations in acoustic engineering and related fields. The main contribution of this paper is the development of a physics-guided, sequence-based generative framework for acoustic metamaterial inverse design that significantly improves upon existing methods by preserving geometric precision and structural connectivity. The comprehensive analysis highlights the innovative approach, rigorous methodology, and potential impact on the field of acoustic metamaterials and machine learning applications in engineering.
This study addresses the challenges composers and sound designers face in creating and refining tools to achieve their musical goals. Using evolutionary processes to promote diversity and foster serendipitous discoveries, we automate the search through uncharted sonic spaces for sound discovery, arguing that diversity-promoting algorithms can bridge the gap between the theoretical realisation and practical accessibility of sounds. We describe a system for generative sound synthesis combining Quality Diversity (QD) algorithms with a supervised discriminative model, inspired by the Innovation Engine algorithm, and explore different configurations and the interplay between the chosen synthesis approach and the discriminative model. We examine the interaction between Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs, introducing a novel approach that uses multiple specialised CPPNs for different frequency ranges; this yields simpler networks while maintaining performance comparable to single-CPPN setups. We also investigate evolutionary stepping stones by analysing goal switches between musical and non-musical contexts, revealing how lineages traverse unlikely paths to current elites. Expanding the behaviour space of a previous study to include various sound durations, we uncover specialisation within temporal niches. Results indicate that CPPN and DSP graphs coupled with a Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) and a deep learning classifier can generate a substantial variety of synthetic sounds, diverse and innovative across temporal and contextual dimensions. We present the generated sound objects through an online explorer and as rendered sound files, and, in the context of music composition, an experimental application that showcases their creative potential across various durations and contexts.
Primary: University of Oslo
All Institutions: University of Oslo
This paper presents a novel approach to sound generation using evolutionary algorithms and deep learning, significantly contributing to the field of generative audio systems. The combination of QD algorithms with a discriminative model and the exploration of temporal niches in sound synthesis showcases the potential for innovative sound design and creative exploration.
The methodology combines Quality Diversity (QD) algorithms with a supervised discriminative model to automate sound generation, leveraging Compositional Pattern Producing Networks (CPPNs) and Digital Signal Processing (DSP) graphs. The introduction of multiple specialized CPPNs for different frequency ranges is a notable innovation, simplifying network complexity while maintaining performance. The use of MAP-Elites for diversity-promoting evolutionary search is well-justified, and the integration of a deep learning classifier (YAMNet) for guiding the evolutionary process is a thoughtful approach that enhances the exploration of sonic diversity.
The experiments are comprehensive, involving multiple configurations and extensive iterations (300,000) to evaluate the effectiveness of the proposed methods. The results demonstrate a substantial variety of generated sounds, with detailed analysis of the evolutionary paths and the impact of different configurations on sound quality. The use of subjective listening sessions adds qualitative depth to the evaluation, although more structured metrics could enhance the robustness of the findings.
The paper provides access to datasets and source code, which is a strong point for reproducibility. However, the reliance on specific configurations and parameter settings may require careful attention from future researchers to replicate results accurately. The use of Git for version control in tracking evolutionary changes is innovative and beneficial for reproducibility.
While the approach is innovative, it may be limited by the classifier's biases and the complexity of the CPPN networks, which could affect the diversity of generated sounds. Additionally, the subjective nature of sound evaluation may introduce variability in perceived quality. The paper could benefit from exploring alternative classifiers or reward models to mitigate these limitations.
The research has significant implications for sound design and music composition, providing tools that can inspire new creative processes. The automated exploration of sonic spaces could democratize access to sound design, allowing composers and sound designers with limited technical expertise to discover novel sounds. The findings could also influence future research in generative audio systems and evolutionary algorithms in creative domains. This paper presents a novel approach to sound generation using evolutionary algorithms and deep learning, significantly contributing to the field of generative audio systems. The combination of QD algorithms with a discriminative model and the exploration of temporal niches in sound synthesis showcases the potential for innovative sound design and creative exploration.
Respiratory diseases remain a leading cause of global mortality, where timely and accurate diagnosis is critical to improving patient outcomes and reducing healthcare burdens. While prior work has explored audio-based models for respiratory disease detection, such unimodal approaches often suffer from limited generalizability and diagnostic precision. In this paper, we propose RespiraMFM, a Multimodal Foundation Model that integrates respiratory sounds with patient medical history and symptoms to enhance diagnostic accuracy and disease detection capabilities. We introduce an effective contrastive alignment strategy for audio-text multimodal integration, allowing the model to learn better cross-modal representations between respiratory sounds and corresponding textual clinical information. We evaluate RespiraMFM across five major respiratory diseases using seven real-world datasets in both supervised fine-tuning and zero-shot settings, achieving a 9.15% improvement in AUROC on supervised tasks and a 20.98% gain on zero-shot tasks over existing baselines. These findings underscore the potential of our framework to advance early diagnosis and improve clinical decision-making in respiratory disease management.
Primary: The Ohio State University
All Institutions: The Ohio State University, University of Southern California, University of Chicago
The main contribution of this paper is the development of RespiraMFM, a multimodal foundation model that enhances respiratory disease identification through innovative contrastive audio-language alignment. This work significantly advances the state of the art in multimodal medical diagnostics, showcasing the potential of integrating diverse data sources to improve clinical outcomes.
The paper introduces RespiraMFM, a novel multimodal foundation model that effectively integrates respiratory audio with textual clinical information through a contrastive alignment strategy. The two-stage training architecture, which separates the alignment of audio and text embeddings from the final prediction stage, is innovative and addresses the misalignment issues present in previous models. The use of contrastive learning to align non-linguistic acoustic features with text descriptions is a significant methodological advancement, allowing for improved cross-modal representation learning.
The evaluation of RespiraMFM across five major respiratory diseases using seven real-world datasets is comprehensive. The paper reports substantial improvements in AUROC scores over existing baselines in both supervised and zero-shot settings, demonstrating the model's robustness and generalization capabilities. The experimental setup is well-structured, with clear delineation of tasks, datasets, and evaluation metrics, which enhances the credibility of the results.
The paper provides sufficient implementation details, including the architecture of the model, training configurations, and data preprocessing steps. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work could benefit from sharing the model and datasets to facilitate independent validation.
The model's performance is contingent on the quality and consistency of the symptom metadata, which can vary across datasets. Additionally, the availability of evaluation data is uneven, particularly for diseases like asthma and pneumonia, which may affect the generalizability of the findings. The authors acknowledge that incorporating additional modalities, such as medical imaging, could further enhance the model's performance.
The proposed model has significant implications for improving early diagnosis and clinical decision-making in respiratory disease management. By leveraging multimodal data, RespiraMFM could potentially reduce healthcare burdens and improve patient outcomes, particularly in resource-constrained settings. The model's ability to generalize to unseen diseases also suggests a promising avenue for future research in medical diagnostics. The main contribution of this paper is the development of RespiraMFM, a multimodal foundation model that enhances respiratory disease identification through innovative contrastive audio-language alignment. This work significantly advances the state of the art in multimodal medical diagnostics, showcasing the potential of integrating diverse data sources to improve clinical outcomes.
Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.
Primary: University of Groningen
All Institutions: University of Groningen
The main contribution of this paper is the introduction of a controlled framework for investigating the role of prosodic cues in sarcasm perception using synthetic speech. This study advances our understanding of how prosody influences sarcasm interpretation and highlights the differences in cue weighting between human listeners and machine models, offering valuable insights for both fields of speech technology and psycholinguistics.
The methodology is robust, utilizing a controlled framework for generating synthetic speech through neural TTS with prompt-based prosodic conditioning. The factorial design allows for the manipulation of prosodic dimensions while maintaining lexical content, which is critical for isolating the effects of individual prosodic cues on sarcasm perception. The use of a multimodal foundation model (Qwen3-Omni) for comparative analysis of human and machine perception is innovative, providing insights into cue weighting differences between biological and artificial systems.
The experiments are well-structured, involving a significant number of participants (66) for human perception ratings and utilizing a foundation model for machine evaluations. The statistical analyses are thorough, employing mixed-effects models to account for variability among participants and items. The results are compelling, highlighting the dominance of loudness in human sarcasm perception while revealing the model's reliance on speech rate. The use of orthogonal stimulus sets enhances the validity of the findings.
The paper provides sufficient detail regarding the experimental design, stimulus generation, and analysis methods, which supports reproducibility. The use of a publicly available TTS model and clear descriptions of the statistical methods used further enhance the reproducibility of the study.
One limitation is the reliance on a single synthetic speaker voice, which may affect the generalizability of the findings. Additionally, while the study effectively isolates prosodic features, the potential for subtle variations in unmeasured dimensions remains a concern. The inter-rater reliability for individual human ratings was low, which may indicate variability in perception among participants.
This research has significant implications for understanding sarcasm perception in both human and machine contexts. The findings could inform the development of more sophisticated conversational agents and improve the design of speech synthesis systems that can convey nuanced emotional cues. The methodological framework established in this study could pave the way for future research in prosody and speech perception across different languages and contexts. The main contribution of this paper is the introduction of a controlled framework for investigating the role of prosodic cues in sarcasm perception using synthetic speech. This study advances our understanding of how prosody influences sarcasm interpretation and highlights the differences in cue weighting between human listeners and machine models, offering valuable insights for both fields of speech technology and psycholinguistics.
In this work, we analyze the ability of NCSN++ U-Net based audio dereverberation models to capture global room characteristics in their intermediate representations. Through an empirical study of both a state-of-the-art diffusion-based model and a discriminative counterpart, we show that deeper layers encode structured room impulse response (RIR)-dependent embeddings. Moreover, the discriminative ability of this implicit room representation correlates with dereverberation performance across objective metrics. Motivated by this observation, we propose a training strategy that explicitly conditions the network on pre-trained RIR embeddings, obtained via self-supervised contrastive learning. Incorporating RIR conditioning improves representation quality, accelerates convergence, and enhances dereverberation performance, while significantly reducing the number of reverse diffusion steps required by the diffusion-based model during inference.
Primary: University of Hamburg
All Institutions: University of Hamburg
The main contribution of this paper is the introduction of a conditioning strategy that leverages pre-trained RIR embeddings to enhance the performance of U-Net based audio dereverberation models. This work significantly advances the understanding of how deep learning models can implicitly learn room characteristics and improve dereverberation performance, marking a valuable addition to the field of audio signal processing.
The paper presents a novel approach to audio dereverberation by analyzing the intermediate representations of NCSN++ U-Net models and their ability to capture room characteristics. The introduction of a conditioning strategy using pre-trained RIR embeddings via self-supervised contrastive learning is a significant methodological contribution. The use of FiLM for conditioning the U-Net architecture enhances the model's ability to learn structured representations, which is a thoughtful integration of existing techniques. The methodology is well-structured, with clear definitions of the training objectives and loss functions.
The experiments are robust, utilizing the VCTK corpus and a well-defined dataset for evaluating the dereverberation performance. The paper employs objective metrics such as PESQ and DNSMOS, which are appropriate for assessing audio quality. The results demonstrate a clear correlation between the strength of learned representations and dereverberation performance, providing empirical support for the proposed conditioning strategy. However, further exploration of subjective evaluations or user studies could enhance the findings.
The paper provides sufficient details regarding the experimental setup, including dataset splits, model architectures, and training hyperparameters. The inclusion of a GitHub repository for the RIR encoder enhances reproducibility. However, more explicit instructions on how to replicate the entire experimental setup would be beneficial for readers.
While the paper effectively demonstrates improvements in dereverberation performance, it does not address potential limitations of the proposed method, such as its performance in real-world scenarios compared to controlled environments. Additionally, the reliance on specific datasets may limit the generalizability of the findings.
The findings of this research have significant implications for audio processing applications, particularly in enhancing speech intelligibility in reverberant environments. The proposed methods could be applied in various domains, including telecommunications, hearing aids, and virtual communication platforms, where clear audio quality is crucial. The integration of RIR embeddings could also inspire further research into conditioning frameworks in other audio tasks. The main contribution of this paper is the introduction of a conditioning strategy that leverages pre-trained RIR embeddings to enhance the performance of U-Net based audio dereverberation models. This work significantly advances the understanding of how deep learning models can implicitly learn room characteristics and improve dereverberation performance, marking a valuable addition to the field of audio signal processing.
Removing intermediate representations and separately trained decoding stages has become an important direction in generative modeling. In text-to-speech, however, high-quality systems are still commonly built through an intermediate acoustic representation before waveform synthesis. In this work, we present BareWave, a fully waveform-native framework for direct text-to-wave generation in flow-matching TTS. We consider this setting to raise three training challenges: raw-waveform modeling lacks a strong pretrained representational scaffold, different stages of training benefit from different noise schedules, and data-space perceptual objectives do not automatically share the temporal structure of the velocity-space flow objective. As a result, direct waveform training is hard to optimize efficiently, hard to push toward a strong final operating point with a fixed recipe, and hard to integrate effective perceptual refinement. Guided by this view, we develop a direct text-to-wave training framework that combines training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), while preserving a single waveform-native inference path without pretrained components at test time. Experiments on zero-shot voice cloning show that strong intelligibility, speaker similarity, and naturalness can be achieved under a fully waveform-native inference path, supporting waveform-native flow-matching TTS as a practical direction. Project page with audio demos is available at https://barewave.github.io/.
Primary: Anhui Province Key Laboratory of Digital Security
All Institutions: Anhui Province Key Laboratory of Digital Security, Alibaba Group, Tongyi Fun Team
The main contribution of this paper is the introduction of BareWave, a fully waveform-native framework for text-to-speech synthesis that effectively addresses key challenges in direct waveform generation. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights its potential to advance the state-of-the-art in TTS systems.
The paper introduces BareWave, a novel framework for text-to-speech (TTS) that aims to eliminate intermediate acoustic representations and pretrained components during inference. The methodology is well-structured, addressing three critical training challenges: the absence of a pretrained scaffold, the need for varying noise schedules during training, and the integration of perceptual objectives. The proposed solutions, including training-time representation alignment, staged noise scheduling, and velocity-aware perceptual alignment (VAPA), are innovative and effectively contribute to the framework's performance.
The experiments conducted on zero-shot voice cloning demonstrate the efficacy of BareWave, showcasing strong results in intelligibility, speaker similarity, and naturalness. The paper provides comprehensive ablation studies that highlight the importance of each component in the training design, reinforcing the contributions of VAPA and staged noise scheduling. The results are competitive with existing systems, indicating that the proposed method is a viable alternative to traditional TTS approaches.
The paper includes detailed implementation details, including model architecture, training strategies, and evaluation metrics, which enhance reproducibility. However, the lack of a public code repository limits the ability for others to replicate the results fully. The authors mention that code and checkpoints will be released soon, which is a positive aspect for future reproducibility.
One limitation is the reliance on a specific dataset (Emilia) for training, which may affect the generalizability of the results to other datasets or languages. Additionally, while the paper addresses the challenges of direct waveform synthesis, it does not explore the potential computational costs associated with the proposed training methods, which could be significant.
The implications of this research are substantial, as it pushes the boundaries of TTS technology by enabling direct text-to-waveform synthesis without intermediate steps. This could lead to more efficient TTS systems that are easier to deploy and maintain. The potential applications range from voice assistants to personalized speech synthesis, making it relevant for various industries. The main contribution of this paper is the introduction of BareWave, a fully waveform-native framework for text-to-speech synthesis that effectively addresses key challenges in direct waveform generation. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights its potential to advance the state-of-the-art in TTS systems.
Attributing a synthetic utterance to its originating system remains an open challenge: closed-set models fail to reject unseen synthesizers and produce overconfident predictions. To address this, we propose a dual-branch gated fusion framework that pairs XLSR-53 with CORES, a 66-dimensional descriptor that, unlike prior Linear Filter Bank (LFB)-only work, spans cepstral, oscillatory, rhythmic, energy, and spectral dimensions to capture complementary synthesis artifacts. Our analysis shows XLSR-53 remains discriminative in-domain (ID) while CORES generalizes stably under distribution shift (OOD), yet their naive concatenation fails due to SSL representational imbalance. To resolve this, an input-conditioned gate adaptively weights each branch under joint training with cross-entropy, an energy margin loss for ID/OOD separation, and a gate diversity term. On the MLAAD benchmark, our system achieves 97.6\% ID accuracy, 4.9\% EERc, and an 83.5\% relative FPR95 reduction over the Interspeech 2025 baseline.
Primary: University of Michigan
All Institutions: University of Michigan-Flint, ProbeTruth Inc, University of Michigan
The main contribution of this paper is the development of a dual-branch gated fusion framework that enhances open-set audio deepfake source tracing by effectively combining self-supervised learning with handcrafted acoustic descriptors. This innovative approach addresses critical challenges in the field, demonstrating strong performance on a demanding benchmark and offering valuable insights into the interplay between different feature representations in audio analysis.
The paper introduces a dual-branch gated fusion framework that effectively combines self-supervised learning (SSL) representations with handcrafted acoustic descriptors. The use of a gating mechanism to adaptively weight contributions from each branch based on input characteristics is a significant methodological advancement. This approach addresses the challenge of overconfidence in predictions made by closed-set models and enhances the system's ability to generalize under out-of-distribution (OOD) conditions. The methodology is well-structured, with a clear rationale for the design choices, such as the selection of features for the CORES descriptor and the training objectives that balance ID classification and OOD detection.
The experiments are conducted on the MLAAD benchmark, which is a demanding open-set evaluation set featuring numerous unseen synthesizers. The results demonstrate strong performance metrics, including a 97.6% accuracy for in-domain classification and significant improvements in OOD detection metrics compared to existing baselines. The ablation studies effectively illustrate the necessity of the gating mechanism and the complementary nature of the feature sets used. However, the paper could benefit from a more detailed comparison against a wider range of existing methods to contextualize its contributions further.
The implementation details are provided with sufficient clarity, including the training procedure, hyperparameters, and loss functions used. However, the absence of a publicly available code repository limits the reproducibility of the results. The authors should consider making their code and models available to facilitate further research and validation of their findings.
One limitation noted is the potential for gate collapse when faced with high source diversity, which can undermine the adaptive routing mechanism. Additionally, while the paper discusses future work, it does not fully explore the implications of the proposed method in real-world scenarios, such as its robustness against adversarial attacks or its applicability to other audio domains.
The proposed framework has significant implications for digital forensics and accountability, particularly in combating the rise of audio deepfakes. By improving the ability to trace the source of synthetic utterances, this work contributes to the development of trustworthy AI systems. The methodology could be adapted for other applications in audio analysis, such as speaker verification and emotion recognition, thus broadening its impact across various fields. The main contribution of this paper is the development of a dual-branch gated fusion framework that enhances open-set audio deepfake source tracing by effectively combining self-supervised learning with handcrafted acoustic descriptors. This innovative approach addresses critical challenges in the field, demonstrating strong performance on a demanding benchmark and offering valuable insights into the interplay between different feature representations in audio analysis.
Codec-based autoregressive (AR) speech language models have achieved strong text-to-speech (TTS) quality by modeling speech as sequences of discrete audio tokens with large pretrained backbones. However, this token-level formulation creates a structural efficiency bottleneck: speech-token sequences are much longer than text sequences, requiring the AR backbone to perform causal computation at every token position and maintain a KV cache that grows with the sequence length. We introduce TLDR, a patch-based autoregressive framework that accelerates codec-based AR-TTS by shifting the causal modeling from token-level speech sequences to patch-level sequences. TLDR groups consecutive codec tokens into compact latent patches using a lightweight compressor, models the resulting shorter patch sequence with a frozen pretrained AR-TTS backbone adapted by LoRA, and reconstructs fine-grained speech tokens within each patch using a speaker-conditioned extractor. With a patch size of 4, TLDR achieves a 1.8x inference speedup over the baseline AR-TTS model and reduces global KV-cache memory by up to 75%. Experimental results indicate that patch-level global causal modeling can be a practical way to reduce the inference cost of pretrained codec-based AR-TTS systems without replacing the existing modules.
Primary: Sungkyunkwan University
All Institutions: Sungkyunkwan University, University of Seoul
The main contribution of this paper is the introduction of TLDR, a patch-based autoregressive framework that enhances the efficiency of codec-based TTS systems while maintaining high speech quality. This innovative approach addresses significant bottlenecks in traditional autoregressive models, offering a promising direction for future research and applications in the field of speech synthesis.
The methodology presented in TLDR is innovative in its approach to autoregressive text-to-speech (TTS) systems by introducing a patch-based framework that reduces the inference cost associated with traditional token-level autoregressive models. The authors effectively leverage a lightweight compressor to group discrete audio tokens into patches, which allows for a significant reduction in the number of autoregressive decoding steps and the size of the KV cache. The use of a pretrained AR-TTS backbone adapted through LoRA and the introduction of a speaker-conditioned extractor are notable contributions that enhance the model's ability to maintain speech quality while improving efficiency. The methodology is well-structured, with clear delineation of components and their functions.
The experimental evaluation is robust, utilizing a substantial dataset (585 hours of LibriTTS) and multiple evaluation metrics, including word error rate (WER), speaker similarity, and real-time factor (RTF). The results demonstrate that TLDR achieves a balance between quality and efficiency, with significant improvements in inference speed and memory usage without substantially degrading the quality of the generated speech. The subjective evaluations further validate the model's performance, indicating that it preserves speaker similarity and naturalness comparable to the baseline model.
The paper provides detailed implementation specifics, including the architecture of the token-to-patch compressor and the patch-to-token extractor, as well as the training setup and evaluation protocols. However, the absence of a public code repository or demo URL limits reproducibility. Future work could benefit from making the code available to facilitate further exploration and validation of the proposed methods.
The study is limited to a single AR-TTS backbone (CosyVoice3) and a fixed patch size, which may not generalize across other models or datasets. The subjective evaluation is conducted with a relatively small group of English speakers, which may not capture the diversity of speaker characteristics and preferences. Additionally, the paper acknowledges the need for adaptive patching to improve the quality-latency trade-off, suggesting that the current implementation may not fully exploit the potential of the proposed approach.
TLDR has the potential to significantly reduce the computational resources required for high-quality TTS systems, making them more accessible on resource-constrained devices. This could lead to broader applications in areas such as assistive technologies, voice synthesis for content creation, and more efficient deployment of TTS systems in various industries. However, the risks associated with TTS technology, such as voice forgery, remain a concern that must be addressed through responsible deployment practices. The main contribution of this paper is the introduction of TLDR, a patch-based autoregressive framework that enhances the efficiency of codec-based TTS systems while maintaining high speech quality. This innovative approach addresses significant bottlenecks in traditional autoregressive models, offering a promising direction for future research and applications in the field of speech synthesis.
Recent speech-aware large language models (Speech-LLMs) rely on a pre-trained speech encoder to convert audio into semantic-rich representations consumable by LLM. In this work, instead, we explore: can an LLM learn to read Mel spectrogram directly without a dedicated speech encoder? We propose Mel-LLM, an encoder-free Speech-LLM that feeds lightly pre-processed Mel spectrogram patches directly into the LLM through a linear projection, allowing the LLM to learn speech-text alignment purely through its own parameters. We conduct extensive experiments on both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. For ASR, we evaluate on the OpenASR leaderboard public sets and production-level scaling experiments, demonstrating that the encoder-free solution achieves competitive performance with only limited degradation compared to encoder-initialized counterparts. We find that when data is limited, initialization from a multimodal checkpoint (Phi-4-MM) is crucial for maintaining performance. We also present ablation studies revealing which LLM layers are less relevant to speech encoding. For TTS, we show preliminary results with a next-token VAE approach. While TTS performance is not yet optimal, these results establish the feasibility of a fully unified encoder-free architecture for autoregressive speech-text modeling.
Primary: Microsoft CoreAI
All Institutions: Microsoft CoreAI
The main contribution of this paper is the introduction of Mel-LLM, an encoder-free architecture that demonstrates the feasibility of directly processing Mel spectrograms with a large language model for both ASR and TTS tasks. This work significantly advances the field by simplifying the architecture of speech-language models and providing a foundation for future research in unified multimodal systems.
The paper proposes a novel architecture, Mel-LLM, which eliminates the need for a dedicated speech encoder by allowing a large language model (LLM) to directly process Mel spectrograms. This approach is innovative as it leverages the LLM's inherent capabilities to learn modality-specific processing, which has not been extensively explored in the context of speech-language models. The methodology includes a linear projection of spectrogram patches into the LLM and employs a series of experiments to validate the architecture's effectiveness across automatic speech recognition (ASR) and text-to-speech (TTS) tasks. The use of ablation studies to identify the contributions of different LLM layers is a strong methodological aspect, providing insights into the model's functioning.
The experiments are comprehensive, utilizing various datasets for ASR and TTS tasks, including evaluations on the OpenASR leaderboard. The results demonstrate competitive performance against encoder-initialized models, particularly when sufficient data is available. The paper also includes detailed ablation studies that assess the impact of different initialization strategies and downsampling rates, which adds depth to the experimental evaluation. However, the TTS results are preliminary and indicate that further optimization is necessary.
The paper provides a detailed account of the model architecture, training configurations, and evaluation metrics, which supports reproducibility. However, the absence of a public repository or demo URL limits the practical reproducibility of the results. The training data sources are specified, but without access to the code or trained models, independent verification of the results may be challenging.
One significant limitation is the preliminary nature of the TTS results, which are not yet optimal and suggest that the architecture requires further refinement for practical applications. Additionally, the reliance on the Phi-4-MM initialization for low-resource settings indicates a potential bottleneck in scenarios where such pre-trained models are not available. The paper also acknowledges that the ASR and TTS tasks were explored separately, which may limit the understanding of their joint modeling potential.
The implications of this work are substantial, as it opens new avenues for simplifying speech-language models by integrating speech and text processing within a single architecture. This could lead to more efficient systems that reduce computational overhead and improve performance in real-world applications, such as voice assistants and automated transcription services. The encoder-free approach may also inspire future research in multimodal learning and the development of unified models for various tasks. The main contribution of this paper is the introduction of Mel-LLM, an encoder-free architecture that demonstrates the feasibility of directly processing Mel spectrograms with a large language model for both ASR and TTS tasks. This work significantly advances the field by simplifying the architecture of speech-language models and providing a foundation for future research in unified multimodal systems.
Transformer-based architectures have led to significant improvements in Automatic Speech Recognition (ASR), often at the cost of substantially increased model sizes. A promising approach to address this issue is layer sharing through depth recursion, commonly referred to as the Recursive-Transformer, which involves repeatedly applying the same layers within the model. Despite its potential shown in other fields, this technique remains relatively unexplored in ASR. In this paper, we present an experimental study of the Recursive-Transformer applied to ASR encoder architectures. We systematically investigate the impact of recursion depth and layer allocation within the Recursive-based Transformer. Our results demonstrate that the Recursive-Transformer is a viable alternative, especially when recurrence is applied in the latent space with a restricted number of loops, obtaining comparable performance while reducing the parameter count by 66%.
Primary: INESC-ID, Portugal
All Institutions: INESC-ID, Instituto Superior Técnico
The paper presents a novel approach to ASR by introducing the Recursive-Transformer and its Latent variant, demonstrating significant parameter reduction while maintaining performance. The methodology is rigorous, and the findings have the potential to influence future ASR model designs, particularly in resource-constrained environments.
The paper proposes the Recursive-Transformer and its variant, the Latent-Recursive-Transformer, which innovatively applies depth recursion in ASR encoder architectures. The methodology is well-structured, systematically investigating recursion depth and layer allocation, and includes a thorough analysis of layer similarity in ASR encoders. The introduction of a modular structure (Prelude-Recurrent-Coda) is a notable contribution that enhances the understanding of layer interactions.
The experiments are robust, utilizing established datasets like LibriSpeech and AISHELL-1, and include a variety of configurations to evaluate the impact of recursion depth and layer allocation. The results show a significant reduction in parameters while maintaining competitive performance, indicating a strong experimental design. However, the paper could benefit from more detailed statistical analysis of the results to bolster claims of significance.
The paper provides sufficient implementation details, including the use of the SpeechBrain toolkit and specific configurations for training. However, it lacks a public repository for code and models, which would enhance reproducibility.
The study does not explore dynamic depth strategies in detail, which could limit the generalizability of the findings. Additionally, the focus on English and Mandarin datasets may restrict applicability to other languages and dialects.
The proposed architectures could significantly impact on-device ASR applications, making them more efficient by reducing model size without sacrificing performance. This could lead to broader adoption of ASR technologies in various consumer devices, enhancing accessibility and user experience. The paper presents a novel approach to ASR by introducing the Recursive-Transformer and its Latent variant, demonstrating significant parameter reduction while maintaining performance. The methodology is rigorous, and the findings have the potential to influence future ASR model designs, particularly in resource-constrained environments.
Large audio-language models (LALMs) increasingly use explicit reasoning traces for complex audio understanding, yet the evaluation of reasoning quality remains underexplored. Although process-level benchmarks for process reward models (PRMs) have advanced reasoning evaluation in text and multi-modal domains, comparable evaluation for audio reasoning remains limited. In this paper, we present AudioProcessBench, a comprehensive benchmark for step-level process error identification in audio reasoning. AudioProcessBench contains diverse reasoning traces generated by 6 audio and omni language models. Each trace is segmented into discrete reasoning steps and annotated with binary step correctness and fine-grained error types. Our benchmark evaluates models under three complementary paradigms: (1) step correctness identification, (2) error-type-conditioned detection for diagnosing audio-specific verifier capacities, and (3) chain-level aggregation, where verifiers select or aggregate among multiple reasoning traces for the same question. This design enables a systematic analysis of whether current models can detect process errors, whether their weaknesses differ across audio-specific error types, and whether process verification translates into improved answer selection. AudioProcessBench provides a testbed for future research on audio reasoning verifiers, process reward models, and reliable omni-modal reasoning.
Primary: Monash University
All Institutions: Monash University, Xi'an Jiaotong-Liverpool University, Orygen, University of Melbourne
This paper presents AudioProcessBench, a benchmark for evaluating process-level verification in audio-grounded reasoning. The comprehensive methodology and detailed experimental evaluation highlight its significance in advancing the field of audio-language models, particularly in understanding and diagnosing reasoning errors.
The paper introduces AudioProcessBench, a novel benchmark specifically designed for evaluating process-level verification in audio-grounded reasoning. The methodology includes a systematic approach to segment reasoning traces into discrete steps, annotate them with correctness labels, and categorize errors into six distinct types. The use of both automated and human annotation processes enhances the reliability of the benchmark. The three complementary evaluation paradigms—step correctness, error-type-conditioned detection, and chain-level aggregation—provide a comprehensive framework for assessing model performance. This structured approach is a significant advancement over existing benchmarks that primarily focus on final-answer accuracy.
The experiments conducted on 11 audio and omni-modal language models demonstrate the effectiveness of the proposed benchmark. The results indicate that newer and reasoning-oriented models perform significantly better in process verification tasks. The evaluation metrics, including PRMScore and error-type-specific performance, are well-defined and allow for detailed insights into model capabilities. The analysis of self-critique bias and the relationship between generation and criticism further enriches the experimental findings. However, the paper could benefit from a more extensive comparison with existing benchmarks to contextualize its contributions better.
The paper provides detailed descriptions of the data construction process, annotation protocols, and evaluation metrics, which are essential for reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results. Future work should consider releasing the benchmark and associated tools to facilitate broader adoption and validation.
The benchmark inherits biases from the source datasets and is limited in its coverage of open-ended audio reasoning tasks. The authors acknowledge that the current scale of AudioProcessBench is modest compared to larger answer-level benchmarks, particularly for rare error types. Additionally, the reliance on semi-automated annotations may introduce some ambiguity and inconsistency in labeling.
The introduction of AudioProcessBench has the potential to significantly advance research in audio-grounded reasoning and process verification. By providing a structured framework for evaluating reasoning models, it encourages the development of more reliable audio-language models. The findings could influence future research directions, particularly in improving model robustness and understanding the intricacies of audio perception and reasoning. This paper presents AudioProcessBench, a benchmark for evaluating process-level verification in audio-grounded reasoning. The comprehensive methodology and detailed experimental evaluation highlight its significance in advancing the field of audio-language models, particularly in understanding and diagnosing reasoning errors.
Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench
Primary: Centro di Sonologia Computazionale, University of Padova
All Institutions: Centro di Sonologia Computazionale, University of Padova, Music Technology Group, Universitat Pompeu Fabra
The paper presents LilyBench, the first benchmark for evaluating symbolic music generation and understanding using LLMs, significantly advancing the field by providing a comprehensive evaluation framework and revealing critical insights into model performance.
The paper introduces a novel benchmark, LilyBench, which evaluates both symbolic music generation and understanding using large language models (LLMs) in a unified framework. The methodology is well-structured, utilizing a combination of generation prompts and understanding tasks adapted from existing benchmarks. The use of multiple evaluation metrics (compile rate, Jensen-Shannon similarity, and Fréchet Music Distance) provides a comprehensive approach to assess model performance, revealing the strengths and weaknesses of different evaluation strategies.
The experiments are rigorous, involving four open-weight models and a well-defined reference corpus of Baroque works. The results demonstrate that while LLMs can generate executable LilyPond in zero-shot settings, they struggle with structural understanding tasks. The systematic comparison of metrics highlights the importance of using multiple evaluation approaches to capture model performance accurately.
The authors provide a link to the GitHub repository containing the benchmark, prompt bank, and evaluation code, enhancing reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setups and configurations used for the models.
The paper acknowledges that structural understanding tasks remain challenging, indicating that while generation is feasible, deeper musical reasoning is not yet fully achievable. Additionally, the reliance on specific datasets may limit the generalizability of the findings.
This work has significant implications for the field of symbolic music generation and understanding, providing a standardized evaluation framework that can facilitate future research and development in this area. It opens avenues for improving LLM capabilities in music-related tasks, potentially influencing applications in music education, composition, and automated music analysis. The paper presents LilyBench, the first benchmark for evaluating symbolic music generation and understanding using LLMs, significantly advancing the field by providing a comprehensive evaluation framework and revealing critical insights into model performance.
Using speaker embeddings as conditioning can strengthen speech enhancement, but most methods either require clean enrollment audio or rely on embeddings extracted from noisy speech, which are fragile under noise and domain shift. We propose G-MaP-SE, a guided enhancement framework that builds a clean-speech embedding prior with a Gaussian Mixture Model (GMM) and refines a noisy conditioning embedding by matching it to this prior. The matched prior embedding is then injected into a time-frequency enhancement backbone via a lightweight gated fusion module. Experiments on VoiceBank+DEMAND and DNS Challenge 2020 datasets show that the proposed prior matching consistently outperforms noisy conditioning and substantially narrows the gap to an oracle clean-conditioning upper bound, while requiring no enrollment audio at inference time. The code, audio samples, and checkpoint are available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The main contribution of this paper is the introduction of G-MaP-SE, a guided speech enhancement framework that effectively refines noisy conditioning embeddings using a GMM-based prior matching approach, leading to improved robustness and performance in challenging acoustic environments. This work represents a meaningful advancement in the field of speech enhancement, addressing critical limitations of existing methods while providing a practical solution for real-world applications.
The proposed G-MaP-SE framework introduces a novel approach to speech enhancement by utilizing a Gaussian Mixture Model (GMM) for prior matching of noisy speaker embeddings. This method addresses the common limitations of existing techniques that rely on clean enrollment audio or noisy embeddings, which can be unreliable. The integration of a lightweight gated fusion module allows for effective incorporation of the refined embeddings into the enhancement backbone, showcasing a well-thought-out design that balances complexity and performance. The methodology is sound, leveraging established statistical techniques while innovatively applying them to the problem of speech enhancement.
The experiments conducted on the VoiceBank+DEMAND and DNS Challenge 2020 datasets provide a robust evaluation of the proposed method. The results demonstrate significant improvements in performance metrics such as WB-PESQ and STOI, particularly in cross-domain scenarios, which is a critical aspect of real-world applicability. The use of both in-domain and cross-domain evaluations strengthens the validity of the findings. However, the paper could benefit from additional qualitative assessments, such as user studies or perceptual evaluations, to complement the objective metrics reported.
The paper provides sufficient implementation details, including the architecture used, hyperparameter settings, and the datasets employed. The availability of code and audio samples on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work. However, further clarification on the training process and specific configurations for different experiments would improve reproducibility.
One notable limitation is the reliance on the quality of the GMM prior, which may be affected by the diversity of the clean speech corpus used for training. The performance gains observed in cross-domain evaluations suggest that the method may not generalize equally well across all potential noise conditions and speaker variations. Additionally, the paper acknowledges that the matching process may not fully recover the underlying clean embedding in some cases, indicating room for improvement in the matching algorithm.
The implications of this research are significant for applications in real-time communication systems, hearing aids, and voice-controlled devices, where speech clarity is paramount. By improving the robustness of speech enhancement under various noise conditions without requiring clean enrollment audio, this work can enhance user experience in everyday environments. The method's adaptability to different domains also opens avenues for further research in personalized speech enhancement systems. The main contribution of this paper is the introduction of G-MaP-SE, a guided speech enhancement framework that effectively refines noisy conditioning embeddings using a GMM-based prior matching approach, leading to improved robustness and performance in challenging acoustic environments. This work represents a meaningful advancement in the field of speech enhancement, addressing critical limitations of existing methods while providing a practical solution for real-world applications.
AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Yonsei University, MAAP Lab
The paper presents a controlled study of cross-generator AI-generated music detection, introducing a novel approach to evaluating the impact of different audio token spaces on detection performance. This work is significant as it not only advances the understanding of AI-generated music detection but also provides a framework for future research in the field.
The paper introduces CoMoE, a novel compact classifier designed to evaluate the impact of different audio tokenizers on AI-generated music detection. The methodology is robust, leveraging a controlled experimental setup where only the input token space varies while keeping the downstream architecture constant. This design allows for a clear isolation of the effects of different tokenizers on detection performance, which is a significant advancement in the field. The use of codec-style discrete audio tokens as a primary experimental axis is innovative, as it shifts the focus from traditional continuous representations to a more forensic approach that may yield better insights into the characteristics of AI-generated music.
The experiments are well-structured, utilizing the MoM-open dataset, which is a thoughtful reconstruction of existing benchmarks. The authors provide a comprehensive analysis of the performance of various tokenizers under different evaluation conditions, revealing critical insights into the robustness of AI-generated music detectors. The results indicate that the choice of tokenizer significantly affects detection performance, particularly under generator shift scenarios. The evaluation metrics used, including AUC and held-out-fake detection rates, are appropriate for the task and provide a clear picture of the model's performance across different conditions.
The paper includes sufficient details regarding the architecture of CoMoE, the training protocols, and the datasets used, which enhances reproducibility. The authors have made their code and data publicly available, which is a positive aspect for the community and encourages further exploration and validation of their findings.
The study acknowledges limitations, such as the reliance on an open reconstruction of the MoM-CLAM dataset and potential biases introduced by the X-Codec mini tokenizer. Additionally, the authors suggest that future work should explore more generator sources and control training-pool sizes, indicating areas where the current study could be expanded or improved.
The findings of this research have significant implications for the field of AI-generated music detection, particularly in improving the robustness of detection systems against unseen generator sources. By emphasizing the importance of tokenizer choice, this work could influence future research directions and methodologies in music AI, potentially leading to more reliable detection systems that can be deployed in real-world applications. The paper presents a controlled study of cross-generator AI-generated music detection, introducing a novel approach to evaluating the impact of different audio token spaces on detection performance. This work is significant as it not only advances the understanding of AI-generated music detection but also provides a framework for future research in the field.
Video-to-audio (V2A) generation must jointly satisfy audiovisual alignment, semantic consistency, temporal synchronization, and perceptual quality. While prior work has mainly focused on model architecture, multimodal conditioning, and training objectives, inference-time alignment for V2A remains underexplored. In this paper, we study inference-time alignment for flow-matching-based V2A generation and formulate it as a search problem. We propose Sequential Monte Carlo Inference-Time Alignment (SMC-ITA), which combines lookahead-based reward estimation and sequential Monte Carlo resampling to reallocate computation adaptively using multi-dimensional cross-modal rewards. SMC-ITA improves over naive single-trajectory sampling, achieving a 55.67% relative reduction in DeSync, a 20.23% improvement in IB-score, and a 15.44% improvement in Audio Quality. Under matched NFE budgets, it also achieves the best overall trade-off among the compared search baselines, outperforming Best-of-N and Beam Search. Ablation studies further show that lookahead improves the reliability of intermediate reward estimates and that systematic resampling is a strong practical default for V2A inference-time alignment.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, University of Rochester, Independent
The main contribution of this paper is the introduction of SMC-ITA, a novel method for inference-time alignment in video-to-audio generation that significantly improves the quality and synchronization of generated audio through adaptive computation and multi-dimensional reward optimization. This work stands out for its innovative approach to a relatively underexplored area in multimodal generation, providing a strong foundation for future research and applications in the field.
The proposed Sequential Monte Carlo Inference-Time Alignment (SMC-ITA) method addresses the challenge of inference-time alignment in video-to-audio generation by framing it as a search problem. It innovatively combines lookahead-based reward estimation with sequential Monte Carlo resampling, allowing for adaptive computation reallocation based on multi-dimensional cross-modal rewards. This approach effectively mitigates the noise in early-stage reward signals, which is a significant improvement over traditional single-trajectory sampling methods. The methodology is well-structured, with clear definitions of the reward functions and a robust framework for trajectory evaluation and resampling.
The experiments are comprehensive, utilizing a well-defined dataset (VGGSound) and comparing SMC-ITA against several strong baselines, including naive sampling, Best-of-N, and Beam Search. The results demonstrate significant improvements across multiple metrics, including a 55.67% reduction in DeSync and enhancements in audio quality and alignment scores. The ablation studies further validate the contributions of individual components, such as the lookahead strategy and the resampling method, reinforcing the robustness of the findings.
The paper provides sufficient details regarding the experimental setup, including the model architecture, evaluation metrics, and specific configurations used for each method. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider sharing the implementation to facilitate further research and validation by the community.
One limitation is the reliance on a fixed NFE budget, which may not be optimal for all scenarios. The paper also does not explore the scalability of SMC-ITA beyond the tested configurations, nor does it address potential computational overhead introduced by the lookahead strategy. Additionally, while the subjective evaluation shows promise, the sample size for human evaluation is relatively small, which may affect the generalizability of the findings.
The advancements in inference-time alignment for video-to-audio generation have significant implications for multimedia applications, including content creation, accessibility for the hearing impaired, and automated media production. By improving the alignment and quality of generated audio, this research can enhance user experiences in various domains, from entertainment to education. The main contribution of this paper is the introduction of SMC-ITA, a novel method for inference-time alignment in video-to-audio generation that significantly improves the quality and synchronization of generated audio through adaptive computation and multi-dimensional reward optimization. This work stands out for its innovative approach to a relatively underexplored area in multimodal generation, providing a strong foundation for future research and applications in the field.
We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.
Primary: OriginAI
All Institutions: Independent, OriginAI
This paper presents a novel voice conversion framework that effectively utilizes non-parallel data for zero-shot conversion, significantly advancing the state-of-the-art in the field. The combination of innovative methodology and strong experimental validation positions this work as a meaningful contribution to the audio processing community.
The methodology introduces a novel palindromic training framework for voice conversion that leverages KNN retrieval to synthesize training pairs from non-parallel data. This approach is innovative as it allows for zero-shot voice conversion without requiring aligned datasets, which is a significant limitation in traditional voice conversion methods. The incorporation of a speaker verification loss to enforce speaker identity consistency is a strong addition, enhancing the robustness of the conversion process. The use of WavLM representations and a transformer-based architecture is well-justified, and the overall design is coherent and well-structured.
The experiments are comprehensive, utilizing both objective and subjective evaluation metrics, including Speaker Similarity, Equal Error Rate (EER), Word Error Rate (WER), and Mean Opinion Scores (MOS). The results demonstrate that the proposed method outperforms existing state-of-the-art systems in speaker similarity while maintaining comparable intelligibility and naturalness. The multilingual evaluation further strengthens the findings, showcasing the model's generalization capabilities across languages without fine-tuning.
The paper provides sufficient details regarding the training procedure, model architecture, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
One limitation is the reliance on a pretrained speaker verification model, which may introduce biases based on the model's training data. Additionally, while the framework shows promise in multilingual settings, the lack of fine-tuning on non-English data may affect performance in real-world applications where language diversity is greater.
The proposed framework has significant implications for applications in voice synthesis, dubbing, and personalized voice assistants, particularly in scenarios where parallel data is scarce. The ability to perform zero-shot voice conversion could democratize access to advanced voice technologies, making them more widely applicable across different languages and dialects. This paper presents a novel voice conversion framework that effectively utilizes non-parallel data for zero-shot conversion, significantly advancing the state-of-the-art in the field. The combination of innovative methodology and strong experimental validation positions this work as a meaningful contribution to the audio processing community.
Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.
Primary: Laboratoire Informatique d’Avignon
All Institutions: Laboratoire Informatique d’Avignon, EURECOM
This paper presents a novel approach to mitigating speaker bias in spoofing detection systems through a teacher-student framework with controlled adversarial training. The methodology is well-founded, and the experimental results indicate a meaningful advancement in the field, particularly in terms of generalization across diverse datasets.
The proposed methodology leverages a teacher-student framework to address speaker bias in spoofing detection systems. The integration of a Gradient Reversal Layer (GRL) and a Variational Information Bottleneck (VIB) is innovative, allowing the model to suppress speaker-specific information while retaining crucial spoofing-related features. This approach is well-justified and effectively addresses the identified limitations of existing methods that rely on speaker labels, which are often unavailable or inconsistent across datasets. The use of a pre-trained speaker recognition model as a teacher is a strong methodological choice, enhancing the robustness of the student model.
The experimental evaluation is thorough, utilizing nine different datasets to assess the model's performance across various conditions. The reported results demonstrate a significant reduction in Equal Error Rate (EER) compared to baseline models, highlighting the effectiveness of the proposed approach. The paper provides detailed comparisons with existing methods, showcasing the improvements in out-of-domain generalization, which is a critical aspect of spoofing detection.
The paper includes sufficient implementation details, such as the architecture of the models, training protocols, and data augmentation techniques. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work could benefit from sharing the model and training code to facilitate further research in this area.
One notable limitation is the reliance on a specific dataset (ASVspoof 5) for training, which may not fully represent the diversity of real-world scenarios. Additionally, while the VIB helps manage the trade-off between speaker invariance and discriminability, the optimal balance between these two objectives may vary across different applications, which could affect generalization.
The findings of this research have significant implications for the security of voice biometric systems, particularly as generative speech technologies become more prevalent. By improving the robustness of spoofing detection systems, this work contributes to enhancing the reliability of voice authentication methods, which are increasingly used in various applications, including banking, personal devices, and security systems. This paper presents a novel approach to mitigating speaker bias in spoofing detection systems through a teacher-student framework with controlled adversarial training. The methodology is well-founded, and the experimental results indicate a meaningful advancement in the field, particularly in terms of generalization across diverse datasets.
Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.
Primary: University of Science, Vietnam National University
All Institutions: University of Science, Vietnam National University
The main contribution of this paper is the introduction of TinyGiantALM, a compact audio-language model that effectively balances efficiency and reasoning depth, making it suitable for deployment in resource-constrained environments. This work represents a significant advancement in the field of audio reasoning, providing a viable alternative to larger models while maintaining competitive performance.
The proposed TinyGiantALM model introduces a novel architecture that emphasizes efficiency in audio-language processing through an Instruction-Aware Feature Refinement framework. The use of a Query-guided Projector and Semantic Gating is innovative, as it allows the model to filter acoustic signals based on user intent rather than relying solely on parameter scaling. This approach is particularly relevant for resource-constrained environments, making the methodology both practical and forward-thinking. The integration of multiple streams for feature extraction and the dynamic modulation of features based on user queries demonstrates a sophisticated understanding of the challenges in audio reasoning.
The experiments conducted on the MMAR benchmark are robust, showcasing the performance of TinyGiantALM against larger models. Achieving a zero-shot accuracy of 46.4% is commendable, especially considering the model's compact size of 1.5B parameters. The paper provides a thorough comparison with various baselines, including models with significantly more parameters, which highlights the effectiveness of the proposed architecture. The inclusion of an ablation study further strengthens the evaluation by isolating the contributions of different components of the model.
The implementation details are well-documented, including specifics on the training strategy, dataset usage, and hardware requirements. The use of publicly available datasets and the clear description of the training process enhance the reproducibility of the results. However, the absence of a publicly accessible code repository limits the ability of other researchers to fully replicate the findings.
While the model shows promise, the authors acknowledge a reasoning gap in logical narrative tasks compared to larger models. Additionally, there are trade-offs in performance when dealing with overly dense or spatial scenes. These limitations suggest that while TinyGiantALM is a step forward, there are still challenges to overcome in achieving parity with larger architectures in all scenarios.
The development of TinyGiantALM has significant implications for deploying audio reasoning models in resource-constrained environments such as mobile devices and edge computing. Its efficiency could enable broader access to advanced audio processing capabilities, facilitating applications in various fields, including assistive technologies, smart home devices, and interactive media. The main contribution of this paper is the introduction of TinyGiantALM, a compact audio-language model that effectively balances efficiency and reasoning depth, making it suitable for deployment in resource-constrained environments. This work represents a significant advancement in the field of audio reasoning, providing a viable alternative to larger models while maintaining competitive performance.
Numerous machine learning-based sound field interpolation methods have been proposed. In particular, physics-informed neural networks (PINNs) can accurately interpolate sound fields from a small number of microphones. However, their high computational cost and long training time pose practical challenges for applications requiring real-time processing or online learning. To address this, we propose a hybrid framework that combines PINN-based pre-training with a physics-informed extreme learning machine (PIELM) tailored for acoustic fields. By replacing iterative PINN fine-tuning for each target sound field with closed-form output-layer adaptation using hidden-layer weights pre-trained by PINN, the proposed method efficiently interpolates unknown sound fields from limited observations. Simulation results under simplified one-dimensional free-field conditions demonstrate that, given a pre-trained model, the proposed method achieves interpolation accuracy comparable to that of PINN-based fine-tuning while reducing the adaptation time by more than three orders of magnitude.
Primary: Tokyo Denki University
All Institutions: Tokyo Denki University, Research Institute for Science and Technology of Tokyo Denki University
The main contribution of this paper is the development of a physics-informed extreme learning machine (PIELM) that utilizes PINN-based pre-training to achieve efficient sound field interpolation. This work represents a significant advancement in reducing computational costs while maintaining accuracy, which is essential for real-time applications in audio processing.
The proposed methodology effectively integrates physics-informed neural networks (PINNs) with extreme learning machines (ELMs) to create a hybrid model that addresses the computational inefficiencies of traditional PINNs in sound field interpolation. The use of PINN-based pre-training to stabilize hidden-layer weights is innovative, allowing for rapid adaptation to new sound fields. The closed-form output-layer adaptation significantly reduces training time while maintaining accuracy, which is a notable advancement in the field.
The experiments are well-structured, comparing the proposed method against baseline models, including traditional ELMs and PINNs. The results demonstrate a substantial improvement in adaptation time and accuracy, with clear metrics (NMSE) provided for evaluation. However, the experiments are limited to one-dimensional scenarios, which may not fully capture the complexities of real-world applications.
The paper provides detailed descriptions of the experimental setup, including hyperparameters and training conditions. However, the absence of a publicly available code repository limits reproducibility, as others cannot easily replicate the results or build upon the work.
The primary limitation is the reliance on one-dimensional simulations, which may not generalize to more complex, multi-dimensional sound fields. Additionally, the method's susceptibility to noise at lower SNR levels indicates a need for further robustness improvements. The lack of a demo or project URL also hinders practical application and exploration by the community.
This research has significant implications for real-time sound field applications, such as active noise control and augmented reality, where rapid and accurate sound field interpolation is crucial. The hybrid approach could pave the way for more efficient machine learning models in acoustics and related fields, potentially influencing future research directions. The main contribution of this paper is the development of a physics-informed extreme learning machine (PIELM) that utilizes PINN-based pre-training to achieve efficient sound field interpolation. This work represents a significant advancement in reducing computational costs while maintaining accuracy, which is essential for real-time applications in audio processing.