Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
The proposed methodology introduces CoLMbo-DF, which innovatively integrates structured acoustic features into a language model framework for deepfake detection. By employing a feature-guided approach that grounds reasoning in explicit acoustic evidence, the authors effectively address the limitations of existing models that primarily rely on latent embeddings. The incorporation of chain-of-thought reasoning adds a layer of interpretability, which is crucial for understanding model decisions in deepfake detection. The methodology is well-structured and demonstrates a clear progression from problem identification to solution development.
The experimental section is robust, showcasing a new dataset of audio pairs with chain-of-thought annotations, which is a significant contribution in itself. The results indicate that CoLMbo-DF outperforms existing baselines, even when trained on a smaller scale model. However, the paper could benefit from a more detailed comparison with a wider range of existing methods and metrics to fully validate the claims of superiority. The evaluation metrics should ideally include both subjective and objective measures to comprehensively assess the model's performance.
The paper lacks detailed implementation specifics that would aid in reproducibility. While the methodology is sound, the absence of code or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub repository or supplementary materials with code and data would significantly enhance reproducibility.
One limitation is the reliance on a specific dataset that may not generalize well to all types of deepfake speech. Additionally, while the model improves interpretability, the complexity of integrating structured acoustic features may pose challenges in real-world applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The implications of this research are substantial, particularly in the context of misinformation and digital security. By enhancing deepfake detection systems with interpretable reasoning, the work contributes to the development of more reliable tools for combating audio-based deception. The approach could also be extended to other domains requiring audio analysis and reasoning, such as voice recognition and sentiment analysis. The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Primary: unknown
All Institutions: unknown
FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
The methodology presented in FineLAP is innovative, addressing the challenge of heterogeneous supervision in audio-language models. The introduction of a dual-stream sigmoid loss and a decoupled audio projector allows for effective learning from both clip- and frame-level annotations. This approach is well-justified, as it leverages the strengths of existing models while introducing novel components that enhance performance across various tasks. The use of cluster-based sampling for negative phrases is particularly noteworthy, as it mitigates the scarcity of frame-level annotations and improves the model's ability to generalize.
The experiments conducted are extensive and demonstrate the effectiveness of FineLAP across multiple audio understanding tasks, achieving state-of-the-art results. The evaluation includes a variety of benchmarks, and the ablation studies provide clear insights into the contributions of each component of the model. The results are compelling, showing significant improvements over existing methods, particularly in sound event detection and audio-text retrieval.
The paper provides sufficient implementation details, including training parameters and dataset descriptions, which are crucial for reproducibility. The authors also commit to releasing the code and dataset, which enhances the potential for other researchers to replicate and build upon their work.
Despite its strengths, FineLAP has limitations, such as its inability to handle variable-length audio inputs, which restricts its applicability in scenarios requiring long-form audio processing. Additionally, the focus on sound event detection may overlook other temporally grounded tasks, indicating areas for future exploration.
The advancements made in FineLAP have significant implications for audio understanding and multimodal learning, particularly in applications such as automated audio captioning, sound event detection, and audio editing. The model's ability to leverage heterogeneous data could lead to more robust and flexible audio-language systems, potentially benefiting various industries, including entertainment, accessibility, and security. FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.
Primary: College of Innovation and Technology, University of Michigan-Flint
All Institutions: College of Innovation and Technology, University of Michigan-Flint
The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
The proposed TRACE framework introduces a novel approach to detecting partial audio deepfakes without the need for training or labeled data. By analyzing the first-order dynamics of frozen speech foundation model representations, the methodology cleverly leverages the inherent properties of genuine speech versus manipulated audio. This is a significant departure from traditional supervised methods, showcasing a fresh perspective on audio forensics. However, the paper could benefit from a more detailed explanation of the embedding trajectory analysis and its computational efficiency.
The experiments are well-structured, evaluating TRACE on four benchmarks across two languages and using six different speech foundation models. The results demonstrate competitive performance against fine-tuned supervised baselines, particularly in challenging scenarios like LlamaPartialSpoof. However, the paper lacks comprehensive details on the datasets used, such as their sizes and the specific characteristics of the audio samples, which would enhance the understanding of the evaluation's robustness.
The paper does not provide sufficient details regarding the implementation of TRACE, such as the specific configurations of the speech foundation models used or the exact procedures for embedding trajectory analysis. This lack of detail may hinder reproducibility, as other researchers may struggle to replicate the results without clear guidelines or code availability.
One limitation is the reliance on the performance of existing speech foundation models, which may vary in quality and robustness. Additionally, while the training-free approach is innovative, it may not generalize well to all forms of audio manipulation beyond the tested benchmarks. The paper also does not address potential adversarial attacks against the proposed detection method.
The implications of TRACE are significant for the field of audio forensics, particularly in combating misinformation and enhancing the integrity of audio content. The training-free nature of the method could facilitate its adoption in real-world applications where rapid detection is critical, such as in media verification and security. However, further exploration of its applicability across diverse audio manipulation techniques is necessary. The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.
Primary: Chuzhou University
All Institutions: Chuzhou University, Yeshiva University
The main contribution of this paper is the introduction of MambaVoiceCloning, a novel TTS system that utilizes a fully state-space model conditioning path to improve efficiency and stability while eliminating traditional attention mechanisms. This work significantly advances the field of TTS by addressing key limitations in existing architectures, offering a promising direction for future research and applications.
The methodology of MambaVoiceCloning (MVC) is innovative in its approach to eliminate attention and recurrent layers in a text-to-speech (TTS) system, utilizing a fully state-space model (SSM) conditioning path. The integration of gated bidirectional Mamba text encoders, Temporal Bi-Mamba, and Expressive Mamba with AdaLN modulation showcases a significant architectural redesign that emphasizes efficiency and stability. The linear-time complexity and bounded activation memory are particularly noteworthy, as they address common issues in traditional TTS systems, such as memory pressure and drift in long sequences. The paper provides a clear explanation of the architecture and its components, supported by rigorous theoretical grounding.
The experimental evaluation is comprehensive, utilizing multiple datasets including LJSpeech, LibriTTS, VCTK, and CSS10, which allows for a robust assessment of MVC's performance across various conditions. The paper reports both subjective (MOS, CMOS) and objective metrics (F0 RMSE, MCD, WER), demonstrating statistically significant improvements over baseline models. The inclusion of long-form and cross-lingual evaluations further strengthens the findings, showcasing the model's generalization capabilities. However, while the improvements are statistically reliable, they are described as modest, indicating room for further enhancement.
The authors provide detailed implementation and training protocols, ensuring that the methodology can be reproduced. The use of a unified optimization schedule across all models and the provision of code on GitHub enhances reproducibility. However, the paper could benefit from more explicit details regarding hyperparameter tuning and the specific configurations used for each model.
The paper acknowledges limitations such as the focus on conditioning efficiency over fine-grained emotion control, and the model's training solely on English datasets, which may affect its performance in multilingual contexts. Additionally, the diffusion decoder remains the primary latency bottleneck, which could hinder real-time applications.
The MVC framework has potential implications for real-time TTS applications, particularly in scenarios requiring efficient memory usage and low latency. Its architecture could serve as a drop-in replacement for existing TTS systems, enhancing their deployability in resource-constrained environments. The focus on ethical considerations, such as watermarking and speaker consent, is commendable and highlights the responsible deployment of AI technologies. The main contribution of this paper is the introduction of MambaVoiceCloning, a novel TTS system that utilizes a fully state-space model conditioning path to improve efficiency and stability while eliminating traditional attention mechanisms. This work significantly advances the field of TTS by addressing key limitations in existing architectures, offering a promising direction for future research and applications.
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
Primary: Unknown
All Institutions: Unknown
This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
The paper introduces a novel multimodal dialogue annotation pipeline that curates dialogues from movies and TV series with fine-grained annotations. This approach is significant as it addresses the limitations of existing datasets in terms of expressiveness and diversity. The methodology for generating the MM-Dia dataset and the MM-Dia-Bench testbed is well-articulated, focusing on both explicit and implicit cross-modal control. However, the paper could benefit from a more detailed explanation of the annotation process and the specific criteria used for dialogue selection.
The experiments conducted demonstrate the effectiveness of the MM-Dia dataset in enhancing controllability in multimodal dialogue generation. The evaluation metrics used, while not explicitly detailed in the abstract, are crucial for assessing the performance of the proposed models. The results indicate that current frameworks struggle to replicate the nuanced expressiveness of human interaction, highlighting an important area for future research. However, the paper could improve by providing more comprehensive quantitative results and comparisons with baseline models.
The paper does not provide sufficient details on the implementation of the models or the datasets used, which raises concerns about reproducibility. Clearer guidelines or links to supplementary materials would enhance the ability of other researchers to replicate the findings.
One significant limitation is the reliance on dialogue from movies and TV series, which may not fully capture the diversity of real-world interactions. Additionally, the paper acknowledges limitations in current frameworks to replicate human expressiveness, suggesting that further work is needed to bridge this gap.
The findings of this research have the potential to significantly impact the field of multimodal dialogue systems, particularly in applications such as virtual assistants, interactive storytelling, and entertainment. By improving controllability and expressiveness in dialogue generation, this work could lead to more engaging and human-like interactions in AI systems. This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
The proposed methodology introduces CoLMbo-DF, which innovatively integrates structured acoustic features into a language model framework for deepfake detection. By employing a feature-guided approach that grounds reasoning in explicit acoustic evidence, the authors effectively address the limitations of existing models that primarily rely on latent embeddings. The incorporation of chain-of-thought reasoning adds a layer of interpretability, which is crucial for understanding model decisions in deepfake detection. The methodology is well-structured and demonstrates a clear progression from problem identification to solution development.
The experimental section is robust, showcasing a new dataset of audio pairs with chain-of-thought annotations, which is a significant contribution in itself. The results indicate that CoLMbo-DF outperforms existing baselines, even when trained on a smaller scale model. However, the paper could benefit from a more detailed comparison with a wider range of existing methods and metrics to fully validate the claims of superiority. The evaluation metrics should ideally include both subjective and objective measures to comprehensively assess the model's performance.
The paper lacks detailed implementation specifics that would aid in reproducibility. While the methodology is sound, the absence of code or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub repository or supplementary materials with code and data would significantly enhance reproducibility.
One limitation is the reliance on a specific dataset that may not generalize well to all types of deepfake speech. Additionally, while the model improves interpretability, the complexity of integrating structured acoustic features may pose challenges in real-world applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The implications of this research are substantial, particularly in the context of misinformation and digital security. By enhancing deepfake detection systems with interpretable reasoning, the work contributes to the development of more reliable tools for combating audio-based deception. The approach could also be extended to other domains requiring audio analysis and reasoning, such as voice recognition and sentiment analysis. The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduces a unified modular framework that formulates signal processing, learning, and personalization as Bayesian inference with explicit uncertainty tracking. The proposed framework replaces ad hoc algorithm design with a single probabilistic generative model that continuously adapts to changing acoustic conditions and user preferences. It extends spectral subtraction with principled mechanisms for in-situ personalization and adaptation to acoustic context. The system is implemented as an interconnected probabilistic state-space model, and inference is performed via variational message passing in the \texttt{RxInfer.jl} probabilistic programming environment, enabling real-time Bayesian processing under hearing-aid constraints. Proof-of-concept experiments on the \emph{VoiceBank+DEMAND} corpus show competitive speech quality and noise reduction with 85 effective parameters. The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing and points toward devices that learn continuously through probabilistic inference.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, Lazy Dynamics B.V., GN Advanced Science
The main contribution of this paper is the introduction of a unified Bayesian framework for speech enhancement that adapts to user preferences and acoustic conditions in real-time. This work represents a meaningful advancement in the field of audio processing, particularly in the context of hearing aids, by providing a robust and interpretable model that can learn from its environment.
The paper presents a novel approach to speech enhancement by framing the problem within a Bayesian inference framework. This methodology allows for real-time adaptation to varying acoustic environments and user preferences, which is a significant improvement over traditional fixed-parameter algorithms. The use of a probabilistic generative model and variational message passing for inference is well-justified, and the modular architecture enhances the system's flexibility. However, the paper could benefit from a more detailed explanation of the underlying assumptions of the Bayesian model and how they impact the performance in diverse scenarios.
The experiments conducted on the VoiceBank+DEMAND corpus demonstrate the effectiveness of the proposed framework in terms of speech quality and noise reduction. The results are promising, showing competitive performance with a relatively small number of parameters (85). However, the paper lacks a comprehensive comparison with state-of-the-art methods and does not provide subjective evaluations (e.g., MOS scores) that would strengthen the claims of improved speech quality.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the authors mention the use of the RxInfer.jl environment for inference, there is no link to the code or detailed instructions for replicating the experiments, which is critical for the validation of the proposed methods.
One limitation is the reliance on a specific dataset (VoiceBank+DEMAND), which may not fully represent the diversity of real-world acoustic environments. Additionally, the paper does not address potential computational constraints in real-time applications, particularly in terms of the scalability of the model when deployed in actual hearing aids.
The proposed framework has significant implications for the development of adaptive hearing aids that can personalize user experiences in real-time. By enabling continuous learning and adaptation, this research could lead to improved accessibility for individuals with hearing impairments, enhancing their quality of life in various acoustic settings. The main contribution of this paper is the introduction of a unified Bayesian framework for speech enhancement that adapts to user preferences and acoustic conditions in real-time. This work represents a meaningful advancement in the field of audio processing, particularly in the context of hearing aids, by providing a robust and interpretable model that can learn from its environment.
Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
MOSS-VoiceGenerator presents an innovative approach to generating realistic voices from natural language descriptions, addressing key limitations in existing TTS systems. The combination of a diverse training dataset and advanced model architecture positions this work as a significant contribution to the field of audio synthesis and voice generation.
The methodology presented in MOSS-VoiceGenerator is robust and innovative, leveraging a large-scale dataset derived from cinematic content to train a model that generates realistic voices from natural language descriptions without the need for reference audio. The authors employ a comprehensive data collection and annotation process, ensuring a diverse and expressive dataset. The model architecture integrates autoregressive techniques with a discrete framework, which simplifies deployment and enhances instruction-following capabilities. However, the reliance on a specific dataset type (cinematic) may limit generalizability to other domains.
The experimental evaluation is thorough, utilizing both subjective and objective metrics to assess the model's performance. The inclusion of a public benchmark (InstructTTSEval) for objective evaluation adds credibility to the results. The subjective preference studies provide valuable insights into user experience and model performance across different dimensions. The results indicate that MOSS-VoiceGenerator outperforms several existing models, showcasing its effectiveness in generating expressive and natural-sounding speech.
The paper outlines the training strategy and data processing pipeline in detail, which aids in reproducibility. However, the lack of a publicly accessible demo or project URL limits the ability for other researchers to replicate the work easily. Open-sourcing the model and data pipeline would significantly enhance reproducibility and community engagement.
The authors acknowledge several limitations, including the focus on Chinese and English, which restricts language diversity. The English dataset is smaller, potentially affecting performance in English voice generation. Additionally, the denoising process may introduce artifacts, and the model's output can occasionally lack stability. These limitations suggest areas for future improvement and expansion.
MOSS-VoiceGenerator has significant potential applications in various domains such as audiobook narration, game dubbing, and conversational agents, where realistic and expressive voice generation is crucial. The open-source nature of the project could foster further research and development in controllable TTS systems, contributing to advancements in human-computer interaction and accessibility. MOSS-VoiceGenerator presents an innovative approach to generating realistic voices from natural language descriptions, addressing key limitations in existing TTS systems. The combination of a diverse training dataset and advanced model architecture positions this work as a significant contribution to the field of audio synthesis and voice generation.
We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .
Primary: New York University
All Institutions: New York University, The University of Texas at Austin
ParaSpeechCLAP introduces a dual-encoder model that effectively maps speech and rich textual style descriptions into a common embedding space, significantly advancing the state of the art in style-prompted speech applications. The comprehensive evaluation of its performance across multiple tasks and the innovative use of a classification loss for intrinsic attributes highlight its potential impact on the field of audio machine learning.
The methodology presented in ParaSpeechCLAP is robust, utilizing a dual-encoder architecture that effectively aligns speech and text style captions in a shared embedding space. The introduction of specialized models (ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational) alongside a unified model demonstrates a thoughtful approach to handling diverse stylistic attributes. The use of a multitask loss for the intrinsic model enhances its performance by allowing it to predict specific style attributes, which is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the encoder architecture and the rationale behind the choice of specific models.
The experimental evaluation is thorough, with a clear focus on three applications: style caption retrieval, speech attribute classification, and inference-time guidance for TTS. The results indicate that ParaSpeechCLAP consistently outperforms existing baselines across various metrics, showcasing its effectiveness. The use of multiple datasets and evaluation metrics strengthens the findings, although the paper could enhance clarity by providing more context on the datasets used and their relevance to the tasks.
The paper provides a reasonable level of detail regarding the training setup, including hyperparameters and dataset descriptions. The authors have made their models and code publicly available, which is a positive step towards reproducibility. However, some aspects, such as the specific configurations of the encoder architectures and the training process, could be elaborated further to ensure that other researchers can replicate the results without ambiguity.
One limitation noted is the requirement for selecting the appropriate model variant at inference time, which could complicate deployment in practical applications. Additionally, while the unified model shows promise, it does not outperform specialized models on individual tasks, indicating a potential trade-off in performance that needs to be addressed. The paper also mentions the linear scaling of the best-of-N guidance strategy, which may not be efficient for larger N values.
The implications of this research are significant, particularly in enhancing the capabilities of style-prompted TTS systems and expressive speech retrieval. By supporting a broader range of intrinsic and situational attributes, ParaSpeechCLAP could facilitate advancements in various applications, including expressive spoken dialog systems and personalized speech synthesis. The work also sets a foundation for future research in rich style modeling and evaluation benchmarks, which could further enrich the field. ParaSpeechCLAP introduces a dual-encoder model that effectively maps speech and rich textual style descriptions into a common embedding space, significantly advancing the state of the art in style-prompted speech applications. The comprehensive evaluation of its performance across multiple tasks and the innovative use of a classification loss for intrinsic attributes highlight its potential impact on the field of audio machine learning.
Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India's linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process that uses image-based prompts to encourage spontaneous responses. Images are captured through a separate process that encompasses a broad range of topics, gathered from both within and across districts. The collected data undergoes a rigorous multi-stage quality evaluation, including both automated and manual checks to ensure highest possible standards in audio quality and transcription accuracy. Following this thorough validation, we have open-sourced around 289K images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech, encompassing 112 languages from 165 districts from 31 States and Union territories. Notably, significant of these languages are being represented for the first time in a dataset of this scale, making the VAANI project a groundbreaking effort in preserving and promoting linguistic inclusivity. This data can be instrumental in building inclusive speech models for India, and in advancing research and development across speech, image, and multimodal applications.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science, Robotics Technology Park (ARTPARK), Quest Alliance, Google DeepMind
The VAANI project represents a groundbreaking effort to create a comprehensive dataset that captures India's linguistic diversity, with significant implications for inclusive speech technology development. The methodology is innovative, but the paper would benefit from more detailed experimental evaluations and reproducibility guidelines to maximize its impact in the field.
The methodology employed in the VAANI project is commendable, particularly in its structured approach to data collection and quality evaluation. The use of image-based prompts to elicit spontaneous speech responses is innovative and effectively captures the linguistic diversity of India. The multi-stage quality evaluation process, which includes both automated and manual checks, ensures high standards of audio quality and transcription accuracy, which is crucial for the reliability of the dataset. However, the paper could benefit from a more detailed description of the specific algorithms or techniques used in the quality evaluation process.
The paper outlines a substantial dataset comprising 31,270 hours of audio and 2,067 hours of transcribed speech across 112 languages. This extensive dataset is a significant contribution to the field, particularly for underrepresented languages. However, the paper lacks detailed experimental results demonstrating the effectiveness of the dataset in training inclusive speech models. It would be beneficial to include baseline comparisons or performance metrics to illustrate the dataset's impact on model performance.
The paper does not provide sufficient details regarding the implementation of the data collection and quality evaluation processes, which may hinder reproducibility. While the dataset is open-sourced, additional documentation or guidelines for replicating the data collection methodology would enhance reproducibility.
One limitation of the study is the potential bias in data collection due to the selection of districts and topics for image prompts. Additionally, while the dataset aims to represent linguistic diversity, the focus on only 165 districts may not capture the full spectrum of languages and dialects present in India. The paper also does not address the challenges of data privacy and ethical considerations in collecting speech data from individuals.
The VAANI project has the potential to significantly impact the development of inclusive speech technologies in India, promoting linguistic diversity and accessibility. By providing a comprehensive dataset, it can facilitate research in speech recognition, multimodal applications, and language preservation efforts. This initiative could also inspire similar projects in other linguistically diverse regions, contributing to global efforts in digital inclusivity. The VAANI project represents a groundbreaking effort to create a comprehensive dataset that captures India's linguistic diversity, with significant implications for inclusive speech technology development. The methodology is innovative, but the paper would benefit from more detailed experimental evaluations and reproducibility guidelines to maximize its impact in the field.
Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.
Primary: unknown
All Institutions: XYZ agency
The paper presents a novel hierarchical multimodal architecture for evaluating AI-dubbed content, addressing the challenges of subjective quality assessment through innovative methodologies and strong experimental validation. The contributions made in this work are significant, providing a foundation for future advancements in automated audio-visual quality assessment.
The proposed hierarchical multimodal architecture is innovative in its integration of audio, video, and text features for evaluating AI-dubbed content. The use of lightweight LoRA adapters for parameter-efficient fine-tuning is a valuable contribution, allowing for effective adaptation across modalities without extensive computational resources. The two-stage training pipeline, which combines active learning for proxy MOS generation with fine-tuning based on human ratings, demonstrates a thoughtful approach to overcoming the challenges of limited subjective labels. However, the methodology could benefit from a more detailed explanation of the active learning process and the specific metrics used to derive Proxy MOS.
The experiments are well-structured, utilizing two publicly available datasets (MELD and M2H2) to validate the proposed model. The use of subjective evaluation with a diverse participant group strengthens the findings, and the reported results show a strong correlation with human ratings, indicating the model's effectiveness. The ablation studies provide insights into the contributions of each modality, although more detailed statistical analysis could enhance the robustness of the claims. The results demonstrate that the hierarchical integration of modalities significantly improves performance, which is a critical finding for the field.
The paper provides a reasonable level of detail regarding the experimental setup, including the datasets, training parameters, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits the reproducibility of the results. Including such resources would significantly enhance the paper's impact and facilitate further research in this area.
The primary limitation is the reliance on human ratings for fine-tuning, which can introduce bias and variability. Additionally, the model's performance may be influenced by the quality and diversity of the training data, which could affect its generalizability to other languages or dubbing contexts. The paper does not address potential challenges in scaling the model to larger datasets or different languages, which could limit its applicability.
The proposed architecture has significant implications for the field of AI-generated content evaluation, particularly in enhancing the quality of dubbed media. By providing a scalable solution for automatic assessment, this research could facilitate the widespread adoption of AI dubbing technologies in various industries, including entertainment and education. Furthermore, the integration of multimodal cues aligns with current trends in AI, promoting more human-centered approaches to content generation and evaluation. The paper presents a novel hierarchical multimodal architecture for evaluating AI-dubbed content, addressing the challenges of subjective quality assessment through innovative methodologies and strong experimental validation. The contributions made in this work are significant, providing a foundation for future advancements in automated audio-visual quality assessment.
We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the establishment of a principled standard for auditing LALMs through the identification of distribution-matched datasets and the systematic evaluation of MIA methods. This work is significant as it addresses a critical gap in understanding the vulnerabilities of audio models and sets the stage for future research in this area.
The paper introduces a systematic approach to Membership Inference Attacks (MIA) against Large Audio Language Models (LALMs). The authors utilize a multi-modal blind baseline that incorporates textual, spectral, and prosodic features to evaluate MIA performance. This methodology is innovative as it highlights the challenges posed by non-semantic information in audio datasets, which can lead to misleading MIA results. The identification of distribution-matched datasets for reliable MIA evaluation is a significant methodological contribution, as it provides a clearer framework for understanding model vulnerabilities.
The experiments benchmark multiple MIA methods and include modality disentanglement experiments, which are well-structured and provide insightful results. The correlation of standard MIA scores with blind acoustic artifacts is particularly noteworthy, revealing the potential pitfalls in existing evaluation metrics. The authors present a comprehensive analysis of their findings, demonstrating that LALM memorization is cross-modal, which adds depth to the experimental evaluation.
The paper mentions the use of generative AI tools for manuscript preparation and software development, which raises questions about the reproducibility of the experimental results. However, the authors state that the core ideas and analyses are their original work. The lack of a dedicated project or code repository limits the ability for others to reproduce the experiments fully, which is a significant drawback.
One limitation is the reliance on specific datasets, which may not generalize to all audio language models. Additionally, while the paper identifies distribution shifts as a critical factor in MIA evaluation, it does not explore the implications of these shifts in detail. The absence of a demo or project URL further limits the accessibility of the findings.
The findings have significant implications for the auditing of LALMs, particularly in understanding model vulnerabilities and the risks associated with spurious correlations in audio data. This research could influence future work in the field, particularly in the design of more robust models and evaluation frameworks that account for non-semantic information in audio. The main contribution of this paper is the establishment of a principled standard for auditing LALMs through the identification of distribution-matched datasets and the systematic evaluation of MIA methods. This work is significant as it addresses a critical gap in understanding the vulnerabilities of audio models and sets the stage for future research in this area.
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the EvA architecture, which effectively addresses the evidence bottleneck in large audio language models through a dual-path approach that preserves acoustic evidence before reasoning. This work significantly advances the state of the art in audio understanding, demonstrating the importance of upstream perception in achieving better performance in complex acoustic scenes.
The proposed EvA architecture introduces a dual-path system that effectively combines speech and non-speech audio processing through hierarchical evidence aggregation and non-compressive, time-aligned fusion. This innovative approach addresses the identified evidence bottleneck in existing LALMs, allowing for improved acoustic evidence retention before reasoning. The methodology is well-structured, with clear explanations of the dual-path architecture and the training process, including the creation of the EvA-Perception dataset. However, while the architecture is novel, it builds on existing frameworks such as Whisper and CED-Base, which may limit the perceived originality of the approach.
The experiments conducted demonstrate the effectiveness of the EvA architecture across multiple benchmarks (MMAU, MMAR, MMSU, and CochlScene), with significant improvements in perception-heavy tasks. The results are compelling, showcasing the model's ability to outperform existing systems, particularly in preserving acoustic evidence. The use of a unified zero-shot protocol for evaluation adds rigor to the experimental design. However, the paper could benefit from more detailed comparisons with a broader range of existing models to contextualize the improvements more effectively.
The paper provides a comprehensive overview of the training strategy, including hyperparameters and the two-stage training process. However, the lack of specific implementation details and code availability may hinder full reproducibility. The authors mention that the EvA model is open-source, which is a positive aspect, but the absence of a direct link to the repository limits accessibility for other researchers.
The paper acknowledges several limitations, including the focus on English-only captions in the EvA-Perception dataset and the need for more systematic multilingual evaluation. Additionally, the temporal reasoning capabilities are constrained by the soft event boundaries in the training data, and the model's performance on music analysis is limited by the lack of expert-level concepts. These limitations suggest areas for future work and improvement.
The advancements made by the EvA architecture have significant implications for audio understanding applications, particularly in complex acoustic environments. By improving the retention of acoustic evidence, the model can enhance various tasks, including audio captioning, event detection, and question answering. The open-source nature of the dataset and model also encourages further research and development in the field, potentially leading to more robust audio processing systems in diverse applications. The main contribution of this paper is the introduction of the EvA architecture, which effectively addresses the evidence bottleneck in large audio language models through a dual-path approach that preserves acoustic evidence before reasoning. This work significantly advances the state of the art in audio understanding, demonstrating the importance of upstream perception in achieving better performance in complex acoustic scenes.
Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.
Primary: Inner Mongolia University
All Institutions: Inner Mongolia University
The paper presents MAR3, a novel multi-agent framework for Reference Audio-Visual Segmentation that effectively addresses key challenges in multimodal integration and reasoning. Its structured methodology and strong experimental results position it as a significant contribution to the field of audio-visual machine learning.
The proposed MAR3 framework introduces a novel approach to Reference Audio-Visual Segmentation (Ref-AVS) by decomposing the task into three distinct phases: Recognition, Reasoning, and Reflection. This multi-agent system leverages the Consensus Multimodal Recognition mechanism, which incorporates the Delphi theory to enhance the understanding of multimodal cues, and the Collaborative Object Reasoning strategy that adapts to the difficulty of reference expressions. The Reflective Learning Segmentation mechanism further improves segmentation accuracy through iterative corrections. This structured approach is innovative and addresses key limitations in existing methods, such as the reliance on high-quality instruction-tuning datasets and the lack of reflective validation.
The experiments conducted on the Ref-AVSBench dataset demonstrate that MAR3 achieves state-of-the-art performance, surpassing previous methods by a notable margin (3.4% improvement in J&F score). The paper provides a comprehensive evaluation, including ablation studies that validate the effectiveness of each component of the framework. The metrics used (Jaccard index and F-score) are appropriate for the task, and the results are presented clearly, showcasing the advantages of the proposed method.
While the paper outlines the methodology and experimental setup, it lacks detailed implementation specifics that would facilitate reproduction. Information on the models used for each agent and the exact configurations of the experiments is somewhat limited. Providing code or a more detailed supplementary material would enhance reproducibility.
One notable limitation is that the proposed framework has not been specifically designed to handle cases where the reference expression refers to non-existent objects in the video. This could limit its applicability in real-world scenarios where such ambiguities arise. Additionally, the reliance on multiple agents may introduce complexity in deployment and scalability.
The MAR3 framework has significant implications for applications in film production, video editing, and content creation, where accurate segmentation of audio-visual elements is crucial. By improving the reliability of segmentation in dynamic scenes, this research could enhance user experiences in multimedia applications and contribute to advancements in AI-driven content generation. The paper presents MAR3, a novel multi-agent framework for Reference Audio-Visual Segmentation that effectively addresses key challenges in multimodal integration and reasoning. Its structured methodology and strong experimental results position it as a significant contribution to the field of audio-visual machine learning.
Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems inject acoustic evidence only through a projected prefix, which can be lossy and imperfectly aligned with the LLM input space, providing insufficient fine-grained grounding during decoding. Addressing this limitation is crucial for robust multi-talker ASR, especially in three-talker mixtures. This paper improves LLM-based multi-talker ASR by explicitly injecting talker-aware acoustic evidence into the decoder. We first revisit Connectionist Temporal Classification (CTC)-derived prefix prompting and compare three variants with increasing acoustic content. The CTC information is obtained using the serialized CTC proposed in our previous works. While acoustic-enriched prompts outperform the SOT-only baseline, prefix-only conditioning remains inadequate for three-talker mixtures. We therefore propose a lightweight gated residual cross-attention adapter and design a two-stage acoustic adaptation framework based on low-rank updates (LoRA). In Stage 1, we insert gated cross-attention adapters after the self-attention sub-layer to stably inject acoustic embeddings as external memory. In Stage 2, we refine both the cross-attention adapters and the pretrained LLM's self-attention projections using parameter-efficient LoRA, improving robustness for large backbones under limited data; the learned updates are merged into the base weights for inference. Experiments on Libri2Mix/Libri3Mix under clean and noisy conditions show consistent gains, with particularly large improvements in three-talker settings.
Primary: Kyoto University
All Institutions: Kyoto University, National Institute of Information and Communications Technology
This paper presents a significant advancement in multi-talker ASR by introducing a two-stage acoustic adaptation framework that enhances the integration of acoustic evidence into LLMs. The innovative methodology and promising experimental results position it as a valuable contribution to the field of audio processing and machine learning.
The paper proposes a two-stage acoustic adaptation framework that integrates gated cross-attention adapters into a large language model (LLM) for multi-talker automatic speech recognition (ASR). The methodology is innovative in addressing the limitations of prefix-only conditioning by dynamically injecting talker-aware acoustic evidence during decoding. The use of low-rank updates (LoRA) for parameter-efficient adaptation is particularly noteworthy, as it enhances the model's robustness under limited data conditions. The systematic exploration of CTC-derived prefix prompting variants adds depth to the methodology, although the paper could benefit from a clearer description of the experimental setup and hyperparameter tuning processes.
The experiments are well-structured, utilizing the Libri2Mix and Libri3Mix datasets to evaluate the proposed methods under both clean and noisy conditions. The results demonstrate consistent performance gains, particularly in three-talker scenarios, validating the effectiveness of the proposed gated cross-attention adapters. However, the paper lacks detailed statistical analysis of the results and comparisons with state-of-the-art methods beyond the baseline systems, which would strengthen the claims of improvement.
The paper provides a comprehensive overview of the model architecture and training procedures, but it lacks specific implementation details that would facilitate reproducibility, such as exact hyperparameter settings and training schedules. The absence of a publicly available code repository further hinders reproducibility efforts.
A notable limitation is the reliance on the performance of the gated cross-attention mechanism, which, while effective, still falls short of the robustness offered by the serialized CTC approach in certain scenarios. Additionally, the paper does not address the potential computational overhead introduced by the proposed adaptations, which may limit practical deployment in real-time systems.
The advancements presented in this paper have significant implications for multi-talker ASR systems, particularly in applications such as conference transcription, voice assistants, and accessibility technologies. By improving the ability of LLMs to handle overlapping speech, the research could enhance communication tools for diverse user groups, including those with hearing impairments. This paper presents a significant advancement in multi-talker ASR by introducing a two-stage acoustic adaptation framework that enhances the integration of acoustic evidence into LLMs. The innovative methodology and promising experimental results position it as a valuable contribution to the field of audio processing and machine learning.
A promising approach for steering auditory attention in complex listening environments relies on Auditory Attention Decoding (AAD), which aim to identify the attended speech stream in a multiple speaker scenario from neural recordings. Entrainment-based AAD approaches, typically assume access to clean speech sources and electroencephalography (EEG) signals to exploit low-frequency correlations between the neural response and the attended stimulus. In this study, we propose CA-TCN, a Causal-Anticausal Temporal Convolutional Network that directly classifies the attended speaker. The proposed architecture integrates several best practices from convolutional neural networks in sequence processing tasks. Importantly, it explicitly aligns auditory stimuli and neural responses by employing separate causal and anticausal convolutions respectively, with distinct receptive fields operating in opposite temporal directions. Experimental results, obtained through comparisons with three baseline AAD models, demonstrated that CA-TCN consistently improved decoding accuracy across datasets and decision windows, with gains ranging from 0.5% to 3.2% for subject-independent models and from 0.8% to 2.9% for subject-specific models compared with the next best-performing model, AADNet. Moreover, these improvements were statistically significant in four of the six evaluated settings when comparing Minimum Expected Switch Duration distributions. Beyond accuracy, the model demonstrated spatial robustness across different conditions, as the EEG spatial filters exhibited stable patterns across datasets. Overall, this work introduces an accurate and unified AAD model that outperforms existing methods while considering practical benefits for online processing scenarios. These findings contribute to advancing the state of AAD and its applicability in real-world systems.
Primary: Universidad PĂşblica de Navarra
All Institutions: Universidad Pública de Navarra, Basque Center on Cognition Brain and Language, Technische Universität Berlin, Ikerbasque
The main contribution of this work is the introduction of CA-TCN, a novel deep learning architecture that significantly improves auditory attention decoding accuracy by leveraging causal and anticausal convolutions. This research advances the state of AAD by providing a unified model that outperforms existing methods, demonstrating its potential for real-world applications in auditory processing and assistive technologies.
The proposed CA-TCN architecture is innovative in its use of causal and anticausal convolutions to process auditory stimuli and EEG signals separately, which is a novel approach in the context of Auditory Attention Decoding (AAD). The integration of established practices from convolutional neural networks, such as residual connections and dilated convolutions, enhances the model's ability to capture temporal dependencies effectively. The ablation study provides a solid foundation for understanding the contribution of each architectural component, demonstrating the importance of the TCN module and the incorporation of causality.
The experimental evaluation is robust, utilizing three distinct datasets to benchmark the CA-TCN against established AAD models. The results indicate consistent improvements in decoding accuracy across various decision windows, with significant statistical validation of the findings. The use of Minimum Expected Switch Duration (MESD) as a performance metric is particularly relevant for real-world applications, providing a comprehensive measure of the model's efficiency in practical scenarios.
The paper mentions that the code will be made publicly available upon publication, which is a positive step towards reproducibility. However, the lack of a direct link to the code repository at this stage limits immediate access for verification of results. The detailed methodology and experimental setup described in the paper facilitate reproducibility, although the reliance on specific datasets may still pose challenges for independent validation.
The study's findings are primarily based on controlled experimental conditions, which may not translate directly to real-world scenarios. The model's performance in more complex and naturalistic environments remains to be evaluated. Additionally, while the CA-TCN architecture shows promise, the generalizability of the results across diverse populations and auditory conditions is yet to be fully established.
The implications of this research are significant for the development of advanced auditory attention decoding systems, particularly for applications in assistive hearing technologies. The ability to decode attention in complex auditory environments could enhance the effectiveness of hearing aids and other auditory devices, improving the quality of life for individuals with hearing impairments. The findings also contribute to the broader field of cognitive neuroscience by providing insights into the neural mechanisms underlying auditory attention. The main contribution of this work is the introduction of CA-TCN, a novel deep learning architecture that significantly improves auditory attention decoding accuracy by leveraging causal and anticausal convolutions. This research advances the state of AAD by providing a unified model that outperforms existing methods, demonstrating its potential for real-world applications in auditory processing and assistive technologies.
While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at https://github.com/HarunoriKawano/HEAR
Primary: University of Technology Sydney
All Institutions: University of Technology Sydney, Shibaura Institute of Technology
The paper presents HEAR, a human-inspired audio representation framework that effectively decouples local feature extraction from global task adaptation, achieving high efficiency and competitive performance in audio classification tasks. This innovative approach addresses critical computational challenges in deploying self-supervised learning models in resource-constrained environments, marking a significant advancement in the field of audio representation learning.
The proposed HEAR framework introduces a decoupled architecture inspired by human auditory processing, effectively separating local feature extraction and global task adaptation. This innovative approach addresses the computational inefficiencies of standard Transformer architectures by utilizing an Acoustic Model and a Task Model, which enhances the model's efficiency while maintaining competitive performance across various audio classification tasks. The use of knowledge distillation for an Acoustic Tokenizer further strengthens the model's ability to learn robust audio representations.
The experiments conducted are extensive, utilizing a large-scale dataset of over 10,000 hours of audio for pre-training. The paper evaluates the model's performance on multiple benchmark datasets, demonstrating its competitive accuracy despite significantly reduced parameters and computational costs. The ablation studies provide insights into the contributions of various components of the architecture, validating the effectiveness of the proposed methods.
The paper includes detailed descriptions of the architecture, training procedures, and hyperparameter settings, which should facilitate reproducibility. The availability of code and pre-trained models on GitHub further supports this aspect, although the lack of a demo URL may limit immediate accessibility for practical applications.
While the HEAR architecture shows promise, its performance on certain tasks is still lower compared to larger models, indicating a trade-off between efficiency and accuracy. Additionally, the reliance on specific datasets for training and evaluation may limit the generalizability of the findings to other audio tasks.
The HEAR framework has significant implications for real-time audio processing on resource-constrained devices, making it suitable for applications in mobile devices, IoT, and edge computing. Its efficiency could enable broader adoption of advanced audio representation techniques in practical applications, such as speech recognition and environmental sound classification. The paper presents HEAR, a human-inspired audio representation framework that effectively decouples local feature extraction from global task adaptation, achieving high efficiency and competitive performance in audio classification tasks. This innovative approach addresses critical computational challenges in deploying self-supervised learning models in resource-constrained environments, marking a significant advancement in the field of audio representation learning.
The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact-Focused Self-Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo-fake samples from real audio via two mechanisms: self-conversion and self-reconstruction. The core insight of AFSS lies in enforcing same-speaker constraints, ensuring that real and pseudo-fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state-of-the-art performance with an average EER of 5.45\%, including a significant reduction to 1.23\% on WaveFake and 2.70\% on In-the-Wild, all while eliminating the dependency on pre-collected fake datasets. Our code is publicly available at https://github.com/NguyenLeHaiSonGit/AFSS.
Primary: University of Science, Ho Chi Minh City
All Institutions: University of Science, Ho Chi Minh City, University College Dublin
The paper presents AFSS, a novel approach to bias mitigation in audio deepfake detection, significantly advancing the field by focusing on generation artifacts while eliminating confounding factors. The methodology's innovative design and rigorous experimental validation position it as a meaningful contribution to audio machine learning research.
The proposed Artifact-Focused Self-Synthesis (AFSS) method is innovative in its approach to mitigating bias in audio deepfake detection. By generating pseudo-fake samples through self-conversion and self-reconstruction mechanisms while enforcing same-speaker constraints, the methodology effectively addresses the critical issue of irrelevant confounding factors that typically hinder detection performance across different datasets. The introduction of a learnable reweighting loss further enhances the model's ability to focus on synthetic samples, which is a significant advancement in training strategies for audio deepfake detection.
The experiments conducted across seven diverse datasets provide a robust evaluation of the proposed method. The reported results demonstrate state-of-the-art performance with an average Equal Error Rate (EER) of 5.45%, showcasing significant improvements over existing methods, particularly in challenging real-world scenarios. The comprehensive comparison with multiple baselines and disentanglement methods reinforces the effectiveness of AFSS in achieving superior cross-domain generalization.
The paper provides detailed implementation details, including the architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The availability of the code on GitHub enhances the potential for other researchers to replicate and build upon the work.
While the AFSS method shows promise, it may still be limited by the quality and diversity of the real audio samples used for generating pseudo-fake samples. Additionally, the reliance on specific transformation techniques may not generalize to all types of audio deepfakes, potentially limiting the method's applicability in broader contexts.
The implications of this research are significant, as it addresses a pressing need for reliable audio deepfake detection methods in an era where generative models are becoming increasingly sophisticated. The ability to mitigate bias effectively can enhance the robustness of detection systems, contributing to the prevention of misuse in various applications, including security, media integrity, and misinformation. The paper presents AFSS, a novel approach to bias mitigation in audio deepfake detection, significantly advancing the field by focusing on generation artifacts while eliminating confounding factors. The methodology's innovative design and rigorous experimental validation position it as a meaningful contribution to audio machine learning research.
Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero-shot speech editing--including word-level insertion, deletion, and substitution--without any additional training. Theoretically, we prove that AR-pretrained weights are near-optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM-based AR TTS system. Code and audio samples will be available at https://deft-piroshki-b652b5.netlify.app/.
Primary: Bairong
All Institutions: Bairong
The main contribution of this paper is the introduction of LLaDA-TTS, a novel masked diffusion model for TTS that decouples inference latency from sequence length while enabling zero-shot speech editing capabilities. This work represents a significant advancement in the efficiency and functionality of TTS systems, with implications for both research and practical applications in the field.
The paper introduces LLaDA-TTS, which innovatively replaces the autoregressive (AR) text-to-speech (TTS) model with a masked diffusion model. This approach allows for parallel generation of speech tokens, significantly reducing inference latency. The methodology is well-grounded in theoretical analysis, particularly the proof that AR-pretrained weights are near-optimal for bidirectional masked prediction. The architecture is adaptable to various LLM-based TTS systems, enhancing its applicability. The use of a bidirectional attention mechanism enables zero-shot speech editing, which is a notable advancement in TTS technology. However, the reliance on a specific initialization and the necessity of determining output length in advance may limit flexibility.
The experiments conducted on the Seed-TTS-Eval benchmark demonstrate the efficacy of LLaDA-TTS, achieving competitive character and word error rates while maintaining a significant speed advantage over traditional AR models. The results are well-documented, with comparisons against established baselines, and the paper provides a thorough analysis of the performance metrics. The findings on emergent behaviors, such as AR-like unmasking and attention alignment, are particularly noteworthy, showcasing the model's ability to leverage learned structures from pretraining effectively.
The paper mentions that code and audio samples will be available, which is a positive aspect for reproducibility. However, details on the training process, hyperparameters, and specific configurations used in experiments could be more explicitly outlined to enhance reproducibility further.
The primary limitations include the requirement to specify output length in advance and the non-streaming nature of the current implementation. These factors may hinder real-time applications and flexibility in various use cases. Additionally, while the model shows promise in zero-shot editing, the effectiveness of this feature across diverse scenarios remains to be fully validated.
LLaDA-TTS has the potential to significantly impact the field of speech synthesis by providing a faster and more flexible approach to generating high-quality speech. The ability to perform zero-shot editing could revolutionize applications in voiceovers, audiobooks, and interactive voice response systems. The findings may also inspire further research into the integration of diffusion models in other generative tasks within audio and beyond. The main contribution of this paper is the introduction of LLaDA-TTS, a novel masked diffusion model for TTS that decouples inference latency from sequence length while enabling zero-shot speech editing capabilities. This work represents a significant advancement in the efficiency and functionality of TTS systems, with implications for both research and practical applications in the field.
Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.
Primary: Graduate School of Artificial Intelligence
All Institutions: Graduate School of Artificial Intelligence
The main contribution of this paper is the introduction of a novel audio-visual source separation framework that utilizes visual cues to enhance the quality of audio separation in cinematic contexts. This work represents a significant advancement in the field of audio processing, particularly in integrating multimodal data to improve performance in complex audio environments.
The paper introduces a novel framework for audio-visual CASS (AV-CASS) that integrates visual cues into the audio source separation process. The approach formulates the problem as a conditional generative modeling task using conditional flow matching, which is innovative in the context of audio separation. The introduction of a dedicated visual encoder for a dual-stream setup is a significant methodological advancement, allowing the model to leverage visual context effectively. The training data synthesis pipeline is also noteworthy, as it addresses the challenge of limited cinematic datasets by pairing in-the-wild audio and video streams. Overall, the methodology is well-structured and presents a clear advancement over existing audio-only methods.
The experiments are comprehensive, evaluating the model's performance on synthetic, real-world, and audio-only CASS benchmarks. The results indicate strong performance across these benchmarks, demonstrating the model's ability to generalize from synthetic data to real-world scenarios. However, the paper could benefit from more detailed quantitative metrics and comparisons against state-of-the-art methods to strengthen the evaluation of its contributions.
The paper provides a demo and code repository, which is a positive aspect for reproducibility. However, the details regarding the implementation of the model and the exact configurations used in experiments could be more thoroughly documented to enhance reproducibility further.
One limitation is the reliance on synthetic data for training, which may not fully capture the complexities of real-world audio-visual interactions. Additionally, while the model shows promise, the paper does not extensively discuss its performance in various cinematic contexts, which could affect its applicability in diverse scenarios.
The proposed framework has significant implications for industries such as film production, dubbing, and remastering, where high-quality audio separation is crucial. By leveraging visual cues, the model could enhance the quality of audio in cinematic experiences, potentially leading to improved user engagement and satisfaction. The main contribution of this paper is the introduction of a novel audio-visual source separation framework that utilizes visual cues to enhance the quality of audio separation in cinematic contexts. This work represents a significant advancement in the field of audio processing, particularly in integrating multimodal data to improve performance in complex audio environments.