Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
The proposed methodology introduces CoLMbo-DF, which innovatively integrates structured acoustic features into a language model framework for deepfake detection. By employing a feature-guided approach that grounds reasoning in explicit acoustic evidence, the authors effectively address the limitations of existing models that primarily rely on latent embeddings. The incorporation of chain-of-thought reasoning adds a layer of interpretability, which is crucial for understanding model decisions in deepfake detection. The methodology is well-structured and demonstrates a clear progression from problem identification to solution development.
The experimental section is robust, showcasing a new dataset of audio pairs with chain-of-thought annotations, which is a significant contribution in itself. The results indicate that CoLMbo-DF outperforms existing baselines, even when trained on a smaller scale model. However, the paper could benefit from a more detailed comparison with a wider range of existing methods and metrics to fully validate the claims of superiority. The evaluation metrics should ideally include both subjective and objective measures to comprehensively assess the model's performance.
The paper lacks detailed implementation specifics that would aid in reproducibility. While the methodology is sound, the absence of code or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub repository or supplementary materials with code and data would significantly enhance reproducibility.
One limitation is the reliance on a specific dataset that may not generalize well to all types of deepfake speech. Additionally, while the model improves interpretability, the complexity of integrating structured acoustic features may pose challenges in real-world applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The implications of this research are substantial, particularly in the context of misinformation and digital security. By enhancing deepfake detection systems with interpretable reasoning, the work contributes to the development of more reliable tools for combating audio-based deception. The approach could also be extended to other domains requiring audio analysis and reasoning, such as voice recognition and sentiment analysis. The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
Primary: Unknown
All Institutions: Unknown
This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
The paper introduces a novel multimodal dialogue annotation pipeline that curates dialogues from movies and TV series with fine-grained annotations. This approach is significant as it addresses the limitations of existing datasets in terms of expressiveness and diversity. The methodology for generating the MM-Dia dataset and the MM-Dia-Bench testbed is well-articulated, focusing on both explicit and implicit cross-modal control. However, the paper could benefit from a more detailed explanation of the annotation process and the specific criteria used for dialogue selection.
The experiments conducted demonstrate the effectiveness of the MM-Dia dataset in enhancing controllability in multimodal dialogue generation. The evaluation metrics used, while not explicitly detailed in the abstract, are crucial for assessing the performance of the proposed models. The results indicate that current frameworks struggle to replicate the nuanced expressiveness of human interaction, highlighting an important area for future research. However, the paper could improve by providing more comprehensive quantitative results and comparisons with baseline models.
The paper does not provide sufficient details on the implementation of the models or the datasets used, which raises concerns about reproducibility. Clearer guidelines or links to supplementary materials would enhance the ability of other researchers to replicate the findings.
One significant limitation is the reliance on dialogue from movies and TV series, which may not fully capture the diversity of real-world interactions. Additionally, the paper acknowledges limitations in current frameworks to replicate human expressiveness, suggesting that further work is needed to bridge this gap.
The findings of this research have the potential to significantly impact the field of multimodal dialogue systems, particularly in applications such as virtual assistants, interactive storytelling, and entertainment. By improving controllability and expressiveness in dialogue generation, this work could lead to more engaging and human-like interactions in AI systems. This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
The proposed methodology introduces CoLMbo-DF, which innovatively integrates structured acoustic features into a language model framework for deepfake detection. By employing a feature-guided approach that grounds reasoning in explicit acoustic evidence, the authors effectively address the limitations of existing models that primarily rely on latent embeddings. The incorporation of chain-of-thought reasoning adds a layer of interpretability, which is crucial for understanding model decisions in deepfake detection. The methodology is well-structured and demonstrates a clear progression from problem identification to solution development.
The experimental section is robust, showcasing a new dataset of audio pairs with chain-of-thought annotations, which is a significant contribution in itself. The results indicate that CoLMbo-DF outperforms existing baselines, even when trained on a smaller scale model. However, the paper could benefit from a more detailed comparison with a wider range of existing methods and metrics to fully validate the claims of superiority. The evaluation metrics should ideally include both subjective and objective measures to comprehensively assess the model's performance.
The paper lacks detailed implementation specifics that would aid in reproducibility. While the methodology is sound, the absence of code or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub repository or supplementary materials with code and data would significantly enhance reproducibility.
One limitation is the reliance on a specific dataset that may not generalize well to all types of deepfake speech. Additionally, while the model improves interpretability, the complexity of integrating structured acoustic features may pose challenges in real-world applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The implications of this research are substantial, particularly in the context of misinformation and digital security. By enhancing deepfake detection systems with interpretable reasoning, the work contributes to the development of more reliable tools for combating audio-based deception. The approach could also be extended to other domains requiring audio analysis and reasoning, such as voice recognition and sentiment analysis. The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduces a unified modular framework that formulates signal processing, learning, and personalization as Bayesian inference with explicit uncertainty tracking. The proposed framework replaces ad hoc algorithm design with a single probabilistic generative model that continuously adapts to changing acoustic conditions and user preferences. It extends spectral subtraction with principled mechanisms for in-situ personalization and adaptation to acoustic context. The system is implemented as an interconnected probabilistic state-space model, and inference is performed via variational message passing in the \texttt{RxInfer.jl} probabilistic programming environment, enabling real-time Bayesian processing under hearing-aid constraints. Proof-of-concept experiments on the \emph{VoiceBank+DEMAND} corpus show competitive speech quality and noise reduction with 85 effective parameters. The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing and points toward devices that learn continuously through probabilistic inference.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, Lazy Dynamics B.V., GN Advanced Science
The main contribution of this paper is the introduction of a unified Bayesian framework for speech enhancement that adapts to user preferences and acoustic conditions in real-time. This work represents a meaningful advancement in the field of audio processing, particularly in the context of hearing aids, by providing a robust and interpretable model that can learn from its environment.
The paper presents a novel approach to speech enhancement by framing the problem within a Bayesian inference framework. This methodology allows for real-time adaptation to varying acoustic environments and user preferences, which is a significant improvement over traditional fixed-parameter algorithms. The use of a probabilistic generative model and variational message passing for inference is well-justified, and the modular architecture enhances the system's flexibility. However, the paper could benefit from a more detailed explanation of the underlying assumptions of the Bayesian model and how they impact the performance in diverse scenarios.
The experiments conducted on the VoiceBank+DEMAND corpus demonstrate the effectiveness of the proposed framework in terms of speech quality and noise reduction. The results are promising, showing competitive performance with a relatively small number of parameters (85). However, the paper lacks a comprehensive comparison with state-of-the-art methods and does not provide subjective evaluations (e.g., MOS scores) that would strengthen the claims of improved speech quality.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the authors mention the use of the RxInfer.jl environment for inference, there is no link to the code or detailed instructions for replicating the experiments, which is critical for the validation of the proposed methods.
One limitation is the reliance on a specific dataset (VoiceBank+DEMAND), which may not fully represent the diversity of real-world acoustic environments. Additionally, the paper does not address potential computational constraints in real-time applications, particularly in terms of the scalability of the model when deployed in actual hearing aids.
The proposed framework has significant implications for the development of adaptive hearing aids that can personalize user experiences in real-time. By enabling continuous learning and adaptation, this research could lead to improved accessibility for individuals with hearing impairments, enhancing their quality of life in various acoustic settings. The main contribution of this paper is the introduction of a unified Bayesian framework for speech enhancement that adapts to user preferences and acoustic conditions in real-time. This work represents a meaningful advancement in the field of audio processing, particularly in the context of hearing aids, by providing a robust and interpretable model that can learn from its environment.
Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
MOSS-VoiceGenerator presents an innovative approach to generating realistic voices from natural language descriptions, addressing key limitations in existing TTS systems. The combination of a diverse training dataset and advanced model architecture positions this work as a significant contribution to the field of audio synthesis and voice generation.
The methodology presented in MOSS-VoiceGenerator is robust and innovative, leveraging a large-scale dataset derived from cinematic content to train a model that generates realistic voices from natural language descriptions without the need for reference audio. The authors employ a comprehensive data collection and annotation process, ensuring a diverse and expressive dataset. The model architecture integrates autoregressive techniques with a discrete framework, which simplifies deployment and enhances instruction-following capabilities. However, the reliance on a specific dataset type (cinematic) may limit generalizability to other domains.
The experimental evaluation is thorough, utilizing both subjective and objective metrics to assess the model's performance. The inclusion of a public benchmark (InstructTTSEval) for objective evaluation adds credibility to the results. The subjective preference studies provide valuable insights into user experience and model performance across different dimensions. The results indicate that MOSS-VoiceGenerator outperforms several existing models, showcasing its effectiveness in generating expressive and natural-sounding speech.
The paper outlines the training strategy and data processing pipeline in detail, which aids in reproducibility. However, the lack of a publicly accessible demo or project URL limits the ability for other researchers to replicate the work easily. Open-sourcing the model and data pipeline would significantly enhance reproducibility and community engagement.
The authors acknowledge several limitations, including the focus on Chinese and English, which restricts language diversity. The English dataset is smaller, potentially affecting performance in English voice generation. Additionally, the denoising process may introduce artifacts, and the model's output can occasionally lack stability. These limitations suggest areas for future improvement and expansion.
MOSS-VoiceGenerator has significant potential applications in various domains such as audiobook narration, game dubbing, and conversational agents, where realistic and expressive voice generation is crucial. The open-source nature of the project could foster further research and development in controllable TTS systems, contributing to advancements in human-computer interaction and accessibility. MOSS-VoiceGenerator presents an innovative approach to generating realistic voices from natural language descriptions, addressing key limitations in existing TTS systems. The combination of a diverse training dataset and advanced model architecture positions this work as a significant contribution to the field of audio synthesis and voice generation.
We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .
Primary: New York University
All Institutions: New York University, The University of Texas at Austin
ParaSpeechCLAP introduces a dual-encoder model that effectively maps speech and rich textual style descriptions into a common embedding space, significantly advancing the state of the art in style-prompted speech applications. The comprehensive evaluation of its performance across multiple tasks and the innovative use of a classification loss for intrinsic attributes highlight its potential impact on the field of audio machine learning.
The methodology presented in ParaSpeechCLAP is robust, utilizing a dual-encoder architecture that effectively aligns speech and text style captions in a shared embedding space. The introduction of specialized models (ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational) alongside a unified model demonstrates a thoughtful approach to handling diverse stylistic attributes. The use of a multitask loss for the intrinsic model enhances its performance by allowing it to predict specific style attributes, which is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the encoder architecture and the rationale behind the choice of specific models.
The experimental evaluation is thorough, with a clear focus on three applications: style caption retrieval, speech attribute classification, and inference-time guidance for TTS. The results indicate that ParaSpeechCLAP consistently outperforms existing baselines across various metrics, showcasing its effectiveness. The use of multiple datasets and evaluation metrics strengthens the findings, although the paper could enhance clarity by providing more context on the datasets used and their relevance to the tasks.
The paper provides a reasonable level of detail regarding the training setup, including hyperparameters and dataset descriptions. The authors have made their models and code publicly available, which is a positive step towards reproducibility. However, some aspects, such as the specific configurations of the encoder architectures and the training process, could be elaborated further to ensure that other researchers can replicate the results without ambiguity.
One limitation noted is the requirement for selecting the appropriate model variant at inference time, which could complicate deployment in practical applications. Additionally, while the unified model shows promise, it does not outperform specialized models on individual tasks, indicating a potential trade-off in performance that needs to be addressed. The paper also mentions the linear scaling of the best-of-N guidance strategy, which may not be efficient for larger N values.
The implications of this research are significant, particularly in enhancing the capabilities of style-prompted TTS systems and expressive speech retrieval. By supporting a broader range of intrinsic and situational attributes, ParaSpeechCLAP could facilitate advancements in various applications, including expressive spoken dialog systems and personalized speech synthesis. The work also sets a foundation for future research in rich style modeling and evaluation benchmarks, which could further enrich the field. ParaSpeechCLAP introduces a dual-encoder model that effectively maps speech and rich textual style descriptions into a common embedding space, significantly advancing the state of the art in style-prompted speech applications. The comprehensive evaluation of its performance across multiple tasks and the innovative use of a classification loss for intrinsic attributes highlight its potential impact on the field of audio machine learning.
Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India's linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process that uses image-based prompts to encourage spontaneous responses. Images are captured through a separate process that encompasses a broad range of topics, gathered from both within and across districts. The collected data undergoes a rigorous multi-stage quality evaluation, including both automated and manual checks to ensure highest possible standards in audio quality and transcription accuracy. Following this thorough validation, we have open-sourced around 289K images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech, encompassing 112 languages from 165 districts from 31 States and Union territories. Notably, significant of these languages are being represented for the first time in a dataset of this scale, making the VAANI project a groundbreaking effort in preserving and promoting linguistic inclusivity. This data can be instrumental in building inclusive speech models for India, and in advancing research and development across speech, image, and multimodal applications.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science, Robotics Technology Park (ARTPARK), Quest Alliance, Google DeepMind
The VAANI project represents a groundbreaking effort to create a comprehensive dataset that captures India's linguistic diversity, with significant implications for inclusive speech technology development. The methodology is innovative, but the paper would benefit from more detailed experimental evaluations and reproducibility guidelines to maximize its impact in the field.
The methodology employed in the VAANI project is commendable, particularly in its structured approach to data collection and quality evaluation. The use of image-based prompts to elicit spontaneous speech responses is innovative and effectively captures the linguistic diversity of India. The multi-stage quality evaluation process, which includes both automated and manual checks, ensures high standards of audio quality and transcription accuracy, which is crucial for the reliability of the dataset. However, the paper could benefit from a more detailed description of the specific algorithms or techniques used in the quality evaluation process.
The paper outlines a substantial dataset comprising 31,270 hours of audio and 2,067 hours of transcribed speech across 112 languages. This extensive dataset is a significant contribution to the field, particularly for underrepresented languages. However, the paper lacks detailed experimental results demonstrating the effectiveness of the dataset in training inclusive speech models. It would be beneficial to include baseline comparisons or performance metrics to illustrate the dataset's impact on model performance.
The paper does not provide sufficient details regarding the implementation of the data collection and quality evaluation processes, which may hinder reproducibility. While the dataset is open-sourced, additional documentation or guidelines for replicating the data collection methodology would enhance reproducibility.
One limitation of the study is the potential bias in data collection due to the selection of districts and topics for image prompts. Additionally, while the dataset aims to represent linguistic diversity, the focus on only 165 districts may not capture the full spectrum of languages and dialects present in India. The paper also does not address the challenges of data privacy and ethical considerations in collecting speech data from individuals.
The VAANI project has the potential to significantly impact the development of inclusive speech technologies in India, promoting linguistic diversity and accessibility. By providing a comprehensive dataset, it can facilitate research in speech recognition, multimodal applications, and language preservation efforts. This initiative could also inspire similar projects in other linguistically diverse regions, contributing to global efforts in digital inclusivity. The VAANI project represents a groundbreaking effort to create a comprehensive dataset that captures India's linguistic diversity, with significant implications for inclusive speech technology development. The methodology is innovative, but the paper would benefit from more detailed experimental evaluations and reproducibility guidelines to maximize its impact in the field.
Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.
Primary: unknown
All Institutions: XYZ agency
The paper presents a novel hierarchical multimodal architecture for evaluating AI-dubbed content, addressing the challenges of subjective quality assessment through innovative methodologies and strong experimental validation. The contributions made in this work are significant, providing a foundation for future advancements in automated audio-visual quality assessment.
The proposed hierarchical multimodal architecture is innovative in its integration of audio, video, and text features for evaluating AI-dubbed content. The use of lightweight LoRA adapters for parameter-efficient fine-tuning is a valuable contribution, allowing for effective adaptation across modalities without extensive computational resources. The two-stage training pipeline, which combines active learning for proxy MOS generation with fine-tuning based on human ratings, demonstrates a thoughtful approach to overcoming the challenges of limited subjective labels. However, the methodology could benefit from a more detailed explanation of the active learning process and the specific metrics used to derive Proxy MOS.
The experiments are well-structured, utilizing two publicly available datasets (MELD and M2H2) to validate the proposed model. The use of subjective evaluation with a diverse participant group strengthens the findings, and the reported results show a strong correlation with human ratings, indicating the model's effectiveness. The ablation studies provide insights into the contributions of each modality, although more detailed statistical analysis could enhance the robustness of the claims. The results demonstrate that the hierarchical integration of modalities significantly improves performance, which is a critical finding for the field.
The paper provides a reasonable level of detail regarding the experimental setup, including the datasets, training parameters, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits the reproducibility of the results. Including such resources would significantly enhance the paper's impact and facilitate further research in this area.
The primary limitation is the reliance on human ratings for fine-tuning, which can introduce bias and variability. Additionally, the model's performance may be influenced by the quality and diversity of the training data, which could affect its generalizability to other languages or dubbing contexts. The paper does not address potential challenges in scaling the model to larger datasets or different languages, which could limit its applicability.
The proposed architecture has significant implications for the field of AI-generated content evaluation, particularly in enhancing the quality of dubbed media. By providing a scalable solution for automatic assessment, this research could facilitate the widespread adoption of AI dubbing technologies in various industries, including entertainment and education. Furthermore, the integration of multimodal cues aligns with current trends in AI, promoting more human-centered approaches to content generation and evaluation. The paper presents a novel hierarchical multimodal architecture for evaluating AI-dubbed content, addressing the challenges of subjective quality assessment through innovative methodologies and strong experimental validation. The contributions made in this work are significant, providing a foundation for future advancements in automated audio-visual quality assessment.
We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the establishment of a principled standard for auditing LALMs through the identification of distribution-matched datasets and the systematic evaluation of MIA methods. This work is significant as it addresses a critical gap in understanding the vulnerabilities of audio models and sets the stage for future research in this area.
The paper introduces a systematic approach to Membership Inference Attacks (MIA) against Large Audio Language Models (LALMs). The authors utilize a multi-modal blind baseline that incorporates textual, spectral, and prosodic features to evaluate MIA performance. This methodology is innovative as it highlights the challenges posed by non-semantic information in audio datasets, which can lead to misleading MIA results. The identification of distribution-matched datasets for reliable MIA evaluation is a significant methodological contribution, as it provides a clearer framework for understanding model vulnerabilities.
The experiments benchmark multiple MIA methods and include modality disentanglement experiments, which are well-structured and provide insightful results. The correlation of standard MIA scores with blind acoustic artifacts is particularly noteworthy, revealing the potential pitfalls in existing evaluation metrics. The authors present a comprehensive analysis of their findings, demonstrating that LALM memorization is cross-modal, which adds depth to the experimental evaluation.
The paper mentions the use of generative AI tools for manuscript preparation and software development, which raises questions about the reproducibility of the experimental results. However, the authors state that the core ideas and analyses are their original work. The lack of a dedicated project or code repository limits the ability for others to reproduce the experiments fully, which is a significant drawback.
One limitation is the reliance on specific datasets, which may not generalize to all audio language models. Additionally, while the paper identifies distribution shifts as a critical factor in MIA evaluation, it does not explore the implications of these shifts in detail. The absence of a demo or project URL further limits the accessibility of the findings.
The findings have significant implications for the auditing of LALMs, particularly in understanding model vulnerabilities and the risks associated with spurious correlations in audio data. This research could influence future work in the field, particularly in the design of more robust models and evaluation frameworks that account for non-semantic information in audio. The main contribution of this paper is the establishment of a principled standard for auditing LALMs through the identification of distribution-matched datasets and the systematic evaluation of MIA methods. This work is significant as it addresses a critical gap in understanding the vulnerabilities of audio models and sets the stage for future research in this area.
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the EvA architecture, which effectively addresses the evidence bottleneck in large audio language models through a dual-path approach that preserves acoustic evidence before reasoning. This work significantly advances the state of the art in audio understanding, demonstrating the importance of upstream perception in achieving better performance in complex acoustic scenes.
The proposed EvA architecture introduces a dual-path system that effectively combines speech and non-speech audio processing through hierarchical evidence aggregation and non-compressive, time-aligned fusion. This innovative approach addresses the identified evidence bottleneck in existing LALMs, allowing for improved acoustic evidence retention before reasoning. The methodology is well-structured, with clear explanations of the dual-path architecture and the training process, including the creation of the EvA-Perception dataset. However, while the architecture is novel, it builds on existing frameworks such as Whisper and CED-Base, which may limit the perceived originality of the approach.
The experiments conducted demonstrate the effectiveness of the EvA architecture across multiple benchmarks (MMAU, MMAR, MMSU, and CochlScene), with significant improvements in perception-heavy tasks. The results are compelling, showcasing the model's ability to outperform existing systems, particularly in preserving acoustic evidence. The use of a unified zero-shot protocol for evaluation adds rigor to the experimental design. However, the paper could benefit from more detailed comparisons with a broader range of existing models to contextualize the improvements more effectively.
The paper provides a comprehensive overview of the training strategy, including hyperparameters and the two-stage training process. However, the lack of specific implementation details and code availability may hinder full reproducibility. The authors mention that the EvA model is open-source, which is a positive aspect, but the absence of a direct link to the repository limits accessibility for other researchers.
The paper acknowledges several limitations, including the focus on English-only captions in the EvA-Perception dataset and the need for more systematic multilingual evaluation. Additionally, the temporal reasoning capabilities are constrained by the soft event boundaries in the training data, and the model's performance on music analysis is limited by the lack of expert-level concepts. These limitations suggest areas for future work and improvement.
The advancements made by the EvA architecture have significant implications for audio understanding applications, particularly in complex acoustic environments. By improving the retention of acoustic evidence, the model can enhance various tasks, including audio captioning, event detection, and question answering. The open-source nature of the dataset and model also encourages further research and development in the field, potentially leading to more robust audio processing systems in diverse applications. The main contribution of this paper is the introduction of the EvA architecture, which effectively addresses the evidence bottleneck in large audio language models through a dual-path approach that preserves acoustic evidence before reasoning. This work significantly advances the state of the art in audio understanding, demonstrating the importance of upstream perception in achieving better performance in complex acoustic scenes.
Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems inject acoustic evidence only through a projected prefix, which can be lossy and imperfectly aligned with the LLM input space, providing insufficient fine-grained grounding during decoding. Addressing this limitation is crucial for robust multi-talker ASR, especially in three-talker mixtures. This paper improves LLM-based multi-talker ASR by explicitly injecting talker-aware acoustic evidence into the decoder. We first revisit Connectionist Temporal Classification (CTC)-derived prefix prompting and compare three variants with increasing acoustic content. The CTC information is obtained using the serialized CTC proposed in our previous works. While acoustic-enriched prompts outperform the SOT-only baseline, prefix-only conditioning remains inadequate for three-talker mixtures. We therefore propose a lightweight gated residual cross-attention adapter and design a two-stage acoustic adaptation framework based on low-rank updates (LoRA). In Stage 1, we insert gated cross-attention adapters after the self-attention sub-layer to stably inject acoustic embeddings as external memory. In Stage 2, we refine both the cross-attention adapters and the pretrained LLM's self-attention projections using parameter-efficient LoRA, improving robustness for large backbones under limited data; the learned updates are merged into the base weights for inference. Experiments on Libri2Mix/Libri3Mix under clean and noisy conditions show consistent gains, with particularly large improvements in three-talker settings.
Primary: Kyoto University
All Institutions: Kyoto University, National Institute of Information and Communications Technology
This paper presents a significant advancement in multi-talker ASR by introducing a two-stage acoustic adaptation framework that enhances the integration of acoustic evidence into LLMs. The innovative methodology and promising experimental results position it as a valuable contribution to the field of audio processing and machine learning.
The paper proposes a two-stage acoustic adaptation framework that integrates gated cross-attention adapters into a large language model (LLM) for multi-talker automatic speech recognition (ASR). The methodology is innovative in addressing the limitations of prefix-only conditioning by dynamically injecting talker-aware acoustic evidence during decoding. The use of low-rank updates (LoRA) for parameter-efficient adaptation is particularly noteworthy, as it enhances the model's robustness under limited data conditions. The systematic exploration of CTC-derived prefix prompting variants adds depth to the methodology, although the paper could benefit from a clearer description of the experimental setup and hyperparameter tuning processes.
The experiments are well-structured, utilizing the Libri2Mix and Libri3Mix datasets to evaluate the proposed methods under both clean and noisy conditions. The results demonstrate consistent performance gains, particularly in three-talker scenarios, validating the effectiveness of the proposed gated cross-attention adapters. However, the paper lacks detailed statistical analysis of the results and comparisons with state-of-the-art methods beyond the baseline systems, which would strengthen the claims of improvement.
The paper provides a comprehensive overview of the model architecture and training procedures, but it lacks specific implementation details that would facilitate reproducibility, such as exact hyperparameter settings and training schedules. The absence of a publicly available code repository further hinders reproducibility efforts.
A notable limitation is the reliance on the performance of the gated cross-attention mechanism, which, while effective, still falls short of the robustness offered by the serialized CTC approach in certain scenarios. Additionally, the paper does not address the potential computational overhead introduced by the proposed adaptations, which may limit practical deployment in real-time systems.
The advancements presented in this paper have significant implications for multi-talker ASR systems, particularly in applications such as conference transcription, voice assistants, and accessibility technologies. By improving the ability of LLMs to handle overlapping speech, the research could enhance communication tools for diverse user groups, including those with hearing impairments. This paper presents a significant advancement in multi-talker ASR by introducing a two-stage acoustic adaptation framework that enhances the integration of acoustic evidence into LLMs. The innovative methodology and promising experimental results position it as a valuable contribution to the field of audio processing and machine learning.
While self-supervised learning (SSL) has revolutionized audio representation, the excessive parameterization and quadratic computational cost of standard Transformers limit their deployment on resource-constrained devices. To address this bottleneck, we propose HEAR (Human-inspired Efficient Audio Representation), a novel decoupled architecture. Inspired by the human cognitive ability to isolate local acoustic features from global context, HEAR splits the processing pipeline into two dedicated modules: an Acoustic Model for local feature extraction and a Task Model for global semantic integration. Coupled with an Acoustic Tokenizer trained via knowledge distillation, our approach enables robust Masked Audio Modeling (MAM). Extensive experiments demonstrate that HEAR requires only 15M parameters and 9.47 GFLOPs for inference, operating at a fraction of the computational cost of conventional foundation models (which typically require 85M-94M parameters). Despite this high efficiency, HEAR achieves highly competitive performance across diverse audio classification benchmarks. The code and pre-trained models are available at https://github.com/HarunoriKawano/HEAR
Primary: University of Technology Sydney
All Institutions: University of Technology Sydney, Shibaura Institute of Technology
The paper presents HEAR, a human-inspired audio representation framework that effectively decouples local feature extraction from global task adaptation, achieving high efficiency and competitive performance in audio classification tasks. This innovative approach addresses critical computational challenges in deploying self-supervised learning models in resource-constrained environments, marking a significant advancement in the field of audio representation learning.
The proposed HEAR framework introduces a decoupled architecture inspired by human auditory processing, effectively separating local feature extraction and global task adaptation. This innovative approach addresses the computational inefficiencies of standard Transformer architectures by utilizing an Acoustic Model and a Task Model, which enhances the model's efficiency while maintaining competitive performance across various audio classification tasks. The use of knowledge distillation for an Acoustic Tokenizer further strengthens the model's ability to learn robust audio representations.
The experiments conducted are extensive, utilizing a large-scale dataset of over 10,000 hours of audio for pre-training. The paper evaluates the model's performance on multiple benchmark datasets, demonstrating its competitive accuracy despite significantly reduced parameters and computational costs. The ablation studies provide insights into the contributions of various components of the architecture, validating the effectiveness of the proposed methods.
The paper includes detailed descriptions of the architecture, training procedures, and hyperparameter settings, which should facilitate reproducibility. The availability of code and pre-trained models on GitHub further supports this aspect, although the lack of a demo URL may limit immediate accessibility for practical applications.
While the HEAR architecture shows promise, its performance on certain tasks is still lower compared to larger models, indicating a trade-off between efficiency and accuracy. Additionally, the reliance on specific datasets for training and evaluation may limit the generalizability of the findings to other audio tasks.
The HEAR framework has significant implications for real-time audio processing on resource-constrained devices, making it suitable for applications in mobile devices, IoT, and edge computing. Its efficiency could enable broader adoption of advanced audio representation techniques in practical applications, such as speech recognition and environmental sound classification. The paper presents HEAR, a human-inspired audio representation framework that effectively decouples local feature extraction from global task adaptation, achieving high efficiency and competitive performance in audio classification tasks. This innovative approach addresses critical computational challenges in deploying self-supervised learning models in resource-constrained environments, marking a significant advancement in the field of audio representation learning.
The rapid advancement of generative models has enabled highly realistic audio deepfakes, yet current detectors suffer from a critical bias problem, leading to poor generalization across unseen datasets. This paper proposes Artifact-Focused Self-Synthesis (AFSS), a method designed to mitigate this bias by generating pseudo-fake samples from real audio via two mechanisms: self-conversion and self-reconstruction. The core insight of AFSS lies in enforcing same-speaker constraints, ensuring that real and pseudo-fake samples share identical speaker identity and semantic content. This forces the detector to focus exclusively on generation artifacts rather than irrelevant confounding factors. Furthermore, we introduce a learnable reweighting loss to dynamically emphasize synthetic samples during training. Extensive experiments across 7 datasets demonstrate that AFSS achieves state-of-the-art performance with an average EER of 5.45\%, including a significant reduction to 1.23\% on WaveFake and 2.70\% on In-the-Wild, all while eliminating the dependency on pre-collected fake datasets. Our code is publicly available at https://github.com/NguyenLeHaiSonGit/AFSS.
Primary: University of Science, Ho Chi Minh City
All Institutions: University of Science, Ho Chi Minh City, University College Dublin
The paper presents AFSS, a novel approach to bias mitigation in audio deepfake detection, significantly advancing the field by focusing on generation artifacts while eliminating confounding factors. The methodology's innovative design and rigorous experimental validation position it as a meaningful contribution to audio machine learning research.
The proposed Artifact-Focused Self-Synthesis (AFSS) method is innovative in its approach to mitigating bias in audio deepfake detection. By generating pseudo-fake samples through self-conversion and self-reconstruction mechanisms while enforcing same-speaker constraints, the methodology effectively addresses the critical issue of irrelevant confounding factors that typically hinder detection performance across different datasets. The introduction of a learnable reweighting loss further enhances the model's ability to focus on synthetic samples, which is a significant advancement in training strategies for audio deepfake detection.
The experiments conducted across seven diverse datasets provide a robust evaluation of the proposed method. The reported results demonstrate state-of-the-art performance with an average Equal Error Rate (EER) of 5.45%, showcasing significant improvements over existing methods, particularly in challenging real-world scenarios. The comprehensive comparison with multiple baselines and disentanglement methods reinforces the effectiveness of AFSS in achieving superior cross-domain generalization.
The paper provides detailed implementation details, including the architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The availability of the code on GitHub enhances the potential for other researchers to replicate and build upon the work.
While the AFSS method shows promise, it may still be limited by the quality and diversity of the real audio samples used for generating pseudo-fake samples. Additionally, the reliance on specific transformation techniques may not generalize to all types of audio deepfakes, potentially limiting the method's applicability in broader contexts.
The implications of this research are significant, as it addresses a pressing need for reliable audio deepfake detection methods in an era where generative models are becoming increasingly sophisticated. The ability to mitigate bias effectively can enhance the robustness of detection systems, contributing to the prevention of misuse in various applications, including security, media integrity, and misinformation. The paper presents AFSS, a novel approach to bias mitigation in audio deepfake detection, significantly advancing the field by focusing on generation artifacts while eliminating confounding factors. The methodology's innovative design and rigorous experimental validation position it as a meaningful contribution to audio machine learning research.
Large language model (LLM)-based text-to-speech (TTS) systems achieve remarkable naturalness via autoregressive (AR) decoding, but require N sequential steps to generate N speech tokens. We present LLaDA-TTS, which replaces the AR LLM with a masked diffusion model that completes generation in a fixed number of parallel steps, decoupling inference latency from sequence length. Remarkably, using only 50 hours of fine-tuning data, we successfully transfer a pretrained AR checkpoint to the masked diffusion paradigm via bidirectional attention. At 64 steps, LLaDA-TTS achieves 0.98% CER (zh) and 1.96% WER (en) on Seed-TTS-Eval, matching the original CosyVoice 3 baseline performance while delivering a 2x LLM-stage speedup--a notable acceleration achieved despite the absence of KV cache, an optimization the AR baseline heavily relies on. Beyond acceleration, the bidirectional architecture naturally enables zero-shot speech editing--including word-level insertion, deletion, and substitution--without any additional training. Theoretically, we prove that AR-pretrained weights are near-optimal for bidirectional masked prediction under the locality property of acoustic tokens, explaining this rapid convergence. This general method modifies only the attention mask and objective, applying seamlessly to any LLM-based AR TTS system. Code and audio samples will be available at https://deft-piroshki-b652b5.netlify.app/.
Primary: Bairong
All Institutions: Bairong
The main contribution of this paper is the introduction of LLaDA-TTS, a novel masked diffusion model for TTS that decouples inference latency from sequence length while enabling zero-shot speech editing capabilities. This work represents a significant advancement in the efficiency and functionality of TTS systems, with implications for both research and practical applications in the field.
The paper introduces LLaDA-TTS, which innovatively replaces the autoregressive (AR) text-to-speech (TTS) model with a masked diffusion model. This approach allows for parallel generation of speech tokens, significantly reducing inference latency. The methodology is well-grounded in theoretical analysis, particularly the proof that AR-pretrained weights are near-optimal for bidirectional masked prediction. The architecture is adaptable to various LLM-based TTS systems, enhancing its applicability. The use of a bidirectional attention mechanism enables zero-shot speech editing, which is a notable advancement in TTS technology. However, the reliance on a specific initialization and the necessity of determining output length in advance may limit flexibility.
The experiments conducted on the Seed-TTS-Eval benchmark demonstrate the efficacy of LLaDA-TTS, achieving competitive character and word error rates while maintaining a significant speed advantage over traditional AR models. The results are well-documented, with comparisons against established baselines, and the paper provides a thorough analysis of the performance metrics. The findings on emergent behaviors, such as AR-like unmasking and attention alignment, are particularly noteworthy, showcasing the model's ability to leverage learned structures from pretraining effectively.
The paper mentions that code and audio samples will be available, which is a positive aspect for reproducibility. However, details on the training process, hyperparameters, and specific configurations used in experiments could be more explicitly outlined to enhance reproducibility further.
The primary limitations include the requirement to specify output length in advance and the non-streaming nature of the current implementation. These factors may hinder real-time applications and flexibility in various use cases. Additionally, while the model shows promise in zero-shot editing, the effectiveness of this feature across diverse scenarios remains to be fully validated.
LLaDA-TTS has the potential to significantly impact the field of speech synthesis by providing a faster and more flexible approach to generating high-quality speech. The ability to perform zero-shot editing could revolutionize applications in voiceovers, audiobooks, and interactive voice response systems. The findings may also inspire further research into the integration of diffusion models in other generative tasks within audio and beyond. The main contribution of this paper is the introduction of LLaDA-TTS, a novel masked diffusion model for TTS that decouples inference latency from sequence length while enabling zero-shot speech editing capabilities. This work represents a significant advancement in the efficiency and functionality of TTS systems, with implications for both research and practical applications in the field.
Cinematic Audio Source Separation (CASS) aims to decompose mixed film audio into speech, music, and sound effects, enabling applications like dubbing and remastering. Existing CASS approaches are audio-only, overlooking the inherent audio-visual nature of films, where sounds often align with visual cues. We present the first framework for audio-visual CASS (AV-CASS), leveraging visual context to enhance separation quality. Our method formulates CASS as a conditional generative modeling problem using conditional flow matching, enabling multimodal audio source separation. To address the lack of cinematic datasets with isolated sound tracks, we introduce a training data synthesis pipeline that pairs in-the-wild audio and video streams (e.g., facial videos for speech, scene videos for effects) and design a dedicated visual encoder for this dual-stream setup. Trained entirely on synthetic data, our model generalizes effectively to real-world cinematic content and achieves strong performance on synthetic, real-world, and audio-only CASS benchmarks. Code and demo are available at \url{https://cass-flowmatching.github.io}.
Primary: Graduate School of Artificial Intelligence
All Institutions: Graduate School of Artificial Intelligence
The main contribution of this paper is the introduction of a novel audio-visual source separation framework that utilizes visual cues to enhance the quality of audio separation in cinematic contexts. This work represents a significant advancement in the field of audio processing, particularly in integrating multimodal data to improve performance in complex audio environments.
The paper introduces a novel framework for audio-visual CASS (AV-CASS) that integrates visual cues into the audio source separation process. The approach formulates the problem as a conditional generative modeling task using conditional flow matching, which is innovative in the context of audio separation. The introduction of a dedicated visual encoder for a dual-stream setup is a significant methodological advancement, allowing the model to leverage visual context effectively. The training data synthesis pipeline is also noteworthy, as it addresses the challenge of limited cinematic datasets by pairing in-the-wild audio and video streams. Overall, the methodology is well-structured and presents a clear advancement over existing audio-only methods.
The experiments are comprehensive, evaluating the model's performance on synthetic, real-world, and audio-only CASS benchmarks. The results indicate strong performance across these benchmarks, demonstrating the model's ability to generalize from synthetic data to real-world scenarios. However, the paper could benefit from more detailed quantitative metrics and comparisons against state-of-the-art methods to strengthen the evaluation of its contributions.
The paper provides a demo and code repository, which is a positive aspect for reproducibility. However, the details regarding the implementation of the model and the exact configurations used in experiments could be more thoroughly documented to enhance reproducibility further.
One limitation is the reliance on synthetic data for training, which may not fully capture the complexities of real-world audio-visual interactions. Additionally, while the model shows promise, the paper does not extensively discuss its performance in various cinematic contexts, which could affect its applicability in diverse scenarios.
The proposed framework has significant implications for industries such as film production, dubbing, and remastering, where high-quality audio separation is crucial. By leveraging visual cues, the model could enhance the quality of audio in cinematic experiences, potentially leading to improved user engagement and satisfaction. The main contribution of this paper is the introduction of a novel audio-visual source separation framework that utilizes visual cues to enhance the quality of audio separation in cinematic contexts. This work represents a significant advancement in the field of audio processing, particularly in integrating multimodal data to improve performance in complex audio environments.