Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
The proposed methodology introduces CoLMbo-DF, which innovatively integrates structured acoustic features into a language model framework for deepfake detection. By employing a feature-guided approach that grounds reasoning in explicit acoustic evidence, the authors effectively address the limitations of existing models that primarily rely on latent embeddings. The incorporation of chain-of-thought reasoning adds a layer of interpretability, which is crucial for understanding model decisions in deepfake detection. The methodology is well-structured and demonstrates a clear progression from problem identification to solution development.
The experimental section is robust, showcasing a new dataset of audio pairs with chain-of-thought annotations, which is a significant contribution in itself. The results indicate that CoLMbo-DF outperforms existing baselines, even when trained on a smaller scale model. However, the paper could benefit from a more detailed comparison with a wider range of existing methods and metrics to fully validate the claims of superiority. The evaluation metrics should ideally include both subjective and objective measures to comprehensively assess the model's performance.
The paper lacks detailed implementation specifics that would aid in reproducibility. While the methodology is sound, the absence of code or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub repository or supplementary materials with code and data would significantly enhance reproducibility.
One limitation is the reliance on a specific dataset that may not generalize well to all types of deepfake speech. Additionally, while the model improves interpretability, the complexity of integrating structured acoustic features may pose challenges in real-world applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The implications of this research are substantial, particularly in the context of misinformation and digital security. By enhancing deepfake detection systems with interpretable reasoning, the work contributes to the development of more reliable tools for combating audio-based deception. The approach could also be extended to other domains requiring audio analysis and reasoning, such as voice recognition and sentiment analysis. The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
Primary: QualiaLabs
All Institutions: QualiaLabs
FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
The methodology presented in FastTurn is innovative, combining streaming CTC decoding with acoustic features to enhance turn detection in full-duplex dialogue systems. The architecture is well-structured, comprising three main components that progressively integrate semantic and acoustic cues. The use of a four-stage training pipeline is commendable, as it stabilizes the optimization process and aligns speech and text modalities effectively. However, the reliance on CTC for initial transcription raises concerns about potential error propagation in noisy environments.
The experiments are thorough, utilizing a diverse set of datasets and a comprehensive evaluation framework. The introduction of a new test set with realistic human dialogue scenarios is a significant contribution, allowing for better assessment of the model's performance in practical applications. The results demonstrate that FastTurn outperforms existing baselines in terms of accuracy and latency, underscoring its effectiveness. However, the paper could benefit from additional comparisons with more recent models in the field to contextualize its performance.
The paper provides sufficient details regarding the model architecture, training strategy, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a demo could hinder independent verification of results. Clear instructions for reproducing the experiments would enhance the paper's impact.
One limitation is the potential sensitivity of the model to CTC errors, especially in overlapping speech scenarios. Additionally, while the model shows robustness in various conditions, the performance on English datasets did not meet expectations, indicating a need for further optimization. The paper also does not address the computational resources required for training and inference, which could be a barrier for broader adoption.
The FastTurn framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interaction, such as virtual assistants and customer service bots. By improving turn detection, it can enhance user experience and facilitate more natural conversations. The release of the new dataset also opens avenues for future research in dialogue systems, potentially leading to advancements in multimodal interaction technologies. FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram-domain enhancement, providing complementary cues. Outputs from both branches are fused by a post-processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative-predictive fusion improves robustness and perceptual quality, achieving top performance in the blind-test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin-rong.github.io/gap-urgenet_demo.
Primary: Nanjing University
All Institutions: Nanjing University
The main contribution of this paper is the introduction of GAP-URGENet, a novel generative-predictive fusion framework for universal speech enhancement that demonstrates state-of-the-art performance in the ICASSP 2026 URGENT Challenge. This work significantly advances the field of speech enhancement by effectively integrating generative and predictive methodologies, providing a comprehensive solution to improve speech quality across diverse conditions.
The methodology presented in GAP-URGENet is innovative, combining generative and predictive models to enhance speech quality effectively. The generative branch focuses on full-stack speech restoration using self-supervised learning, while the predictive branch enhances the spectrogram domain, allowing for complementary improvements. The fusion of outputs from both branches through a post-processing module is a significant contribution, particularly the bandwidth extension to achieve high-quality waveforms. The architecture is well-structured, leveraging existing models like DeWavLM and TF-GridNet, which indicates a thoughtful integration of prior work with novel enhancements.
The experimental setup is robust, utilizing comprehensive datasets from the URGENT Challenge, which enhances the credibility of the results. The paper reports substantial improvements over baseline models, with detailed metrics provided for various objective evaluations (DNSMOS, NISQA, UTMOS, etc.), showcasing the effectiveness of the proposed framework. The results indicate that GAP-URGENet achieves superior performance in both objective and subjective evaluations, validating the proposed approach.
The paper provides sufficient details regarding the architecture, training process, and datasets used, which facilitates reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers. Including a link to the code or detailed implementation instructions would enhance reproducibility significantly.
While the paper demonstrates impressive results, it does not address potential limitations such as the computational cost of the model, the need for extensive training data, or the model's performance in real-world applications outside the challenge context. Additionally, the reliance on specific architectures may limit generalizability to other tasks or domains.
The implications of this research extend to various applications in speech enhancement, including telecommunications, assistive technologies for the hearing impaired, and voice recognition systems. By improving speech quality in challenging conditions, the framework can enhance user experience across multiple platforms, making it a valuable contribution to the field of audio processing. The main contribution of this paper is the introduction of GAP-URGENet, a novel generative-predictive fusion framework for universal speech enhancement that demonstrates state-of-the-art performance in the ICASSP 2026 URGENT Challenge. This work significantly advances the field of speech enhancement by effectively integrating generative and predictive methodologies, providing a comprehensive solution to improve speech quality across diverse conditions.
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet's interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.
Primary: National University of Singapore
All Institutions: National University of Singapore, Shenzhen Loop Area Institute, Nanjing University, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong
The paper presents PhiNet, a self-interpretable speaker verification network that enhances transparency in decision-making by leveraging phonetic evidence. This contribution is significant as it addresses the critical need for interpretability in automatic speaker verification systems, bridging the gap between ASV and forensic speaker comparison.
The proposed PhiNet framework introduces a novel approach to speaker verification by integrating phonetic interpretability into the decision-making process. The architecture is designed to provide both local and global interpretability, allowing users to understand the contribution of individual phonemes to the verification score. This is achieved through a phonetic trait extractor and a decision layer that weights phonetic contributions based on their distinctiveness. The methodology is well-structured, leveraging existing neural network techniques while innovatively adapting them to enhance interpretability in ASV systems.
The experiments conducted on benchmark datasets such as VoxCeleb, SITW, and LibriSpeech demonstrate that PhiNet achieves competitive performance compared to traditional black-box ASV models. The evaluation metrics, including equal error rate (EER) and minimum detection cost function (minDCF), provide a solid basis for performance comparison. Additionally, the paper includes qualitative assessments of interpretability through visualizations and leave-$i$th-phoneme-out experiments, which substantiate the claims of enhanced interpretability.
The authors provide a GitHub repository with the code for PhiNet, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameter settings and data preprocessing steps, to facilitate easier reproduction of results by other researchers.
One limitation is the potential for cognitive bias in phoneme weighting, which could affect the model's interpretability and robustness. Additionally, while the framework shows promise, the reliance on phonetic traits may limit its generalizability to diverse speaker populations or languages not represented in the training data. The paper also does not address the computational complexity of the model, which may hinder real-time applications.
The integration of phonetic interpretability into ASV systems has significant implications for high-accountability applications, such as forensic analysis and security. By providing interpretable results, PhiNet can enhance user trust in automated systems and facilitate error tracing in speaker verification tasks. This work could pave the way for more transparent AI systems in sensitive applications, contributing positively to the field of machine learning and audio processing. The paper presents PhiNet, a self-interpretable speaker verification network that enhances transparency in decision-making by leveraging phonetic evidence. This contribution is significant as it addresses the critical need for interpretability in automatic speaker verification systems, bridging the gap between ASV and forensic speaker comparison.
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment. In this paper, we propose \textbf{SCAR}, a robust ECG--language pretraining framework for \textbf{S}emantic \textbf{C}ompensation via \textbf{A}dversarial \textbf{R}emoval. SCAR improves robustness by explicitly training the model to remain semantically aligned with semantically critical missingness and to recover diagnostic meaning from the remaining visible evidence. Specifically, we introduce a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn representations that remain semantically aligned with clinical text even when primary diagnostic evidence is missing. Under such adversarial corruption, we equip the ECG encoder with a semantically supervised adaptive selector that learns to reweight the remaining visible tokens and compensate with secondary yet diagnostically informative morphological cues. To evaluate robustness beyond classification accuracy, we further introduce Counterfactual Missingness Resolution Score (CMRS), which quantifies how well feature preserve diagnostic semantics under missingness. Experiments on $6$ datasets show that SCAR consistently improves semantic robustness under joint lead and temporal missingness, with particularly clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.
Primary: University of Science and Technology Beijing
All Institutions: School of Intelligence Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing
The paper presents SCAR, a robust ECG--language pretraining framework that enhances zero-shot ECG diagnosis by explicitly addressing the challenges posed by missing data through innovative adversarial techniques. The methodology and results contribute meaningfully to the field of machine learning in healthcare, particularly in improving the robustness of diagnostic models under real-world conditions.
The proposed SCAR framework introduces a novel approach to address the challenge of missing ECG data during zero-shot diagnosis by employing adversarial masking to force the model to learn robust representations. The methodology is well-structured, utilizing a differentiable adversarial masker and a semantically supervised adaptive selector, which collectively enhance the model's ability to maintain semantic alignment even under partial observation. The introduction of the Counterfactual Missingness Resolution Score (CMRS) as a metric for evaluating robustness adds significant value to the methodology, allowing for a more nuanced assessment of performance under missingness.
The experiments are comprehensive, utilizing six datasets to validate the effectiveness of SCAR against existing baselines. The results demonstrate significant improvements in both zero-shot classification performance and robustness under various missingness scenarios, particularly highlighting the advantages of the proposed methods in harder cases where primary diagnostic evidence is absent. The ablation studies effectively illustrate the contributions of each component of the framework, reinforcing the robustness of the findings.
The paper provides sufficient implementation details, including training protocols, dataset descriptions, and evaluation metrics, which support reproducibility. However, the absence of a publicly available code repository or demo limits the ease of reproduction for external researchers.
One limitation is the reliance on specific datasets for training and evaluation, which may affect the generalizability of the results to other ECG datasets or clinical settings. Additionally, while the proposed methods show improvements, the paper does not extensively discuss the computational costs associated with the adversarial masking and adaptive selection processes during training.
The implications of this work are significant for clinical practice, as it addresses a common issue in ECG analysis—missing data due to various artifacts. The ability to maintain diagnostic accuracy under such conditions can enhance the reliability of ECG-based diagnoses in real-world scenarios, potentially leading to better patient outcomes. The framework could also inspire further research in robust multimodal learning across other medical domains. The paper presents SCAR, a robust ECG--language pretraining framework that enhances zero-shot ECG diagnosis by explicitly addressing the challenges posed by missing data through innovative adversarial techniques. The methodology and results contribute meaningfully to the field of machine learning in healthcare, particularly in improving the robustness of diagnostic models under real-world conditions.
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Graduate School of Engineering, Third Intelligence, Matsuo Institute, Department of Technology Management for Innovation
The main contribution of this work is the development of T5Gemma-TTS, a novel encoder-decoder model that enhances multilingual zero-shot text-to-speech synthesis through innovative architectural improvements and rigorous experimental validation. This research represents a meaningful advancement in the field of speech synthesis, addressing key challenges and setting a foundation for future exploration in multilingual and cross-lingual applications.
The paper introduces T5Gemma-TTS, an encoder-decoder model that effectively addresses the limitations of autoregressive decoder-only architectures by maintaining persistent text conditioning through cross-attention mechanisms. The integration of Progress-Monitoring Rotary Position Embedding (PM-RoPE) is a significant methodological advancement, allowing for improved duration control during speech synthesis. The model's architecture is well-founded on the T5Gemma pretrained backbone, which enhances its linguistic capabilities without requiring phoneme conversion. The methodology is robust and clearly articulated, demonstrating a thoughtful approach to overcoming existing challenges in zero-shot TTS.
The experimental evaluation is comprehensive, involving a substantial training dataset of 170,000 hours of multilingual speech. The results indicate statistically significant improvements in speaker similarity and character error rates compared to existing models. The paper provides detailed comparisons against multiple baselines, showcasing the model's effectiveness across different languages, including Japanese, Chinese, and Korean. The use of confidence intervals adds rigor to the statistical claims, although some results should be interpreted cautiously due to overlapping intervals.
The authors have made the model weights and code publicly available, which is a positive step towards reproducibility. However, the paper would benefit from more detailed implementation specifics and hyperparameter settings to facilitate easier replication of results by other researchers.
The paper acknowledges several limitations, including higher word error rates on unseen European languages and a real-time factor that may not meet the demands of real-time applications. Additionally, the authors note that the model's performance on certain metrics may be influenced by the codec's limitations, indicating areas for future improvement.
The potential for misuse of zero-shot voice cloning technology is a significant concern, as highlighted by the authors. They emphasize the need for ethical considerations and safeguards in deploying such technologies, which is crucial given the implications for privacy and security. The authors advocate for responsible use and further research into detection methods for synthetic speech. The main contribution of this work is the development of T5Gemma-TTS, a novel encoder-decoder model that enhances multilingual zero-shot text-to-speech synthesis through innovative architectural improvements and rigorous experimental validation. This research represents a meaningful advancement in the field of speech synthesis, addressing key challenges and setting a foundation for future exploration in multilingual and cross-lingual applications.
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model's speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model's cross-linguistic robustness, not only supports that the CDMA framework's approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.
Primary: Zhejiang University
All Institutions: Zhejiang University, Università della Campania “Luigi Vanvitelli”, UKRI, EPSRC, National Natural Science Foundation of China, State Key Laboratory of Brain-Machine Intelligence
This study provides a novel approach to depression detection by integrating speech analysis and neurophysiological validation, demonstrating the critical role of emotional arousal over valence in enhancing detection performance. The methodology and results contribute significantly to the field of computational mental health, offering a framework that is both innovative and applicable across linguistic boundaries.
The paper employs a robust methodology by extending the Cross-Data Multilevel Attention (CDMA) framework to a new linguistic context (Chinese Mandarin) and integrating EEG data for neurophysiological validation. The fusion of read and spontaneous speech across emotional valences is a significant methodological advancement, allowing for a nuanced understanding of emotional arousal in depression detection. The attention mechanisms used are well-justified and effectively enhance the model's performance.
The experiments are comprehensive, utilizing a well-defined dataset (MODMA) and employing rigorous cross-validation techniques. The reported F1-scores (up to 89.6%) demonstrate state-of-the-art performance, and the inclusion of EEG analysis adds a layer of validation that strengthens the findings. The statistical comparisons between different emotional contexts and their impact on detection performance are well-articulated.
The paper provides detailed descriptions of the data acquisition, preprocessing, and model training processes, which supports reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
The study acknowledges limitations such as the modest sample size for EEG recordings and the correlational nature of the findings, which precludes causal inferences. Additionally, the lack of information regarding participants' medication status and comorbidities could influence the results.
The findings have significant implications for clinical practices in mental health, particularly in developing objective diagnostic tools for depression that can be applied across different languages. The neurophysiological validation of speech-based models could pave the way for more interpretable and trustworthy AI systems in mental health assessment. This study provides a novel approach to depression detection by integrating speech analysis and neurophysiological validation, demonstrating the critical role of emotional arousal over valence in enhancing detection performance. The methodology and results contribute significantly to the field of computational mental health, offering a framework that is both innovative and applicable across linguistic boundaries.
While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Czech Science Foundation, e-INFRA CZ project, Ministry of Education, Youth and Sports of the Czech Republic
The paper presents a novel multi-objective evolutionary framework for fusing deepfake speech detectors, achieving state-of-the-art performance while significantly reducing system complexity. This work is a substantial contribution to the field of audio machine learning, providing a comprehensive approach to tackle the challenges posed by deepfake technologies.
The paper introduces an innovative multi-objective evolutionary framework for fusing deepfake speech detectors using NSGA-II, addressing the critical balance between detection accuracy and system complexity. It explores two encoding strategies—binary-coded detector selection and real-valued weight optimization—demonstrating a systematic approach to ensemble learning that is both effective and efficient. The methodology is well-structured, leveraging evolutionary algorithms to navigate the trade-offs inherent in deepfake detection.
The authors conduct extensive experiments on the ASVspoof 5 dataset, utilizing a diverse pool of 36 SSL-based detectors. The results are robust, showcasing the superiority of the proposed methods over traditional fusion techniques, including simple averaging and logistic regression. The achieved EER of 2.37% indicates a significant performance improvement while reducing system complexity, underscoring the effectiveness of the proposed approach.
The paper provides detailed implementation information, including parameter settings, computational resources, and the use of a GitHub repository for code access. The thoroughness of the experimental setup and the availability of the code enhance the reproducibility of the results, allowing other researchers to validate and build upon this work.
While the proposed method is effective, it is limited by its reliance on score-level fusion, which may overlook deeper interactions that could be exploited through joint fine-tuning of the models. Additionally, the performance is constrained by the quality of the underlying detectors, suggesting that optimizing these base models could further enhance the fusion outcomes.
This research has significant implications for the field of deepfake detection, particularly in enhancing the robustness and efficiency of voice biometric systems. The ability to balance performance and complexity in detector fusion can lead to more practical applications in security and authentication, addressing the growing concerns surrounding deepfake technology. The paper presents a novel multi-objective evolutionary framework for fusing deepfake speech detectors, achieving state-of-the-art performance while significantly reducing system complexity. This work is a substantial contribution to the field of audio machine learning, providing a comprehensive approach to tackle the challenges posed by deepfake technologies.
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.
Primary: University of Michigan
All Institutions: University of Michigan, University of California, Irvine
The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
The methodology employed in Sona is innovative, utilizing a target-conditioned neural pipeline that allows for real-time attenuation of multiple overlapping sound sources. This is a significant advancement over existing systems that typically focus on single-target noise cancellation. The incorporation of user-extensible sound classes through in-situ examples without the need for retraining is a notable feature that enhances user personalization and adaptability. The formative study involving 68 noise-sensitive individuals provides a solid foundation for understanding user needs and preferences, which is crucial for the design of the system.
The experimental evaluation is robust, featuring both technical benchmarking and an in-situ study with 10 participants. The results demonstrate low-latency performance and effective sound attenuation while preserving desired audio, which is critical for maintaining situational awareness. The use of subjective measures to assess user comfort and soundscape mediation effectiveness adds credibility to the findings. However, the small sample size in the in-situ study may limit the generalizability of the results.
The paper does not provide explicit details regarding the implementation or access to the code, which raises concerns about reproducibility. While the methodology is described, without a publicly available implementation or detailed algorithmic descriptions, it may be challenging for other researchers to replicate the results or build upon this work.
One limitation is the small participant size in the in-situ study, which may not adequately represent the broader population of noise-sensitive individuals. Additionally, while the system allows for user-defined sound classes, the effectiveness of the system in highly dynamic or complex sound environments remains to be fully evaluated. There may also be challenges in the real-world application of the technology, such as varying user preferences and environmental conditions.
The potential applications of Sona are significant, particularly for individuals with noise sensitivity, including those with neurodivergent conditions. By enabling users to manage their auditory environments, Sona could enhance comfort and social participation, leading to improved quality of life. The implications extend beyond personal use, as the technology could be adapted for various settings, including workplaces, educational environments, and public spaces. The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Primary: unknown
All Institutions: unknown
FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
The methodology presented in FineLAP is innovative, addressing the challenge of heterogeneous supervision in audio-language models. The introduction of a dual-stream sigmoid loss and a decoupled audio projector allows for effective learning from both clip- and frame-level annotations. This approach is well-justified, as it leverages the strengths of existing models while introducing novel components that enhance performance across various tasks. The use of cluster-based sampling for negative phrases is particularly noteworthy, as it mitigates the scarcity of frame-level annotations and improves the model's ability to generalize.
The experiments conducted are extensive and demonstrate the effectiveness of FineLAP across multiple audio understanding tasks, achieving state-of-the-art results. The evaluation includes a variety of benchmarks, and the ablation studies provide clear insights into the contributions of each component of the model. The results are compelling, showing significant improvements over existing methods, particularly in sound event detection and audio-text retrieval.
The paper provides sufficient implementation details, including training parameters and dataset descriptions, which are crucial for reproducibility. The authors also commit to releasing the code and dataset, which enhances the potential for other researchers to replicate and build upon their work.
Despite its strengths, FineLAP has limitations, such as its inability to handle variable-length audio inputs, which restricts its applicability in scenarios requiring long-form audio processing. Additionally, the focus on sound event detection may overlook other temporally grounded tasks, indicating areas for future exploration.
The advancements made in FineLAP have significant implications for audio understanding and multimodal learning, particularly in applications such as automated audio captioning, sound event detection, and audio editing. The model's ability to leverage heterogeneous data could lead to more robust and flexible audio-language systems, potentially benefiting various industries, including entertainment, accessibility, and security. FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.
Primary: College of Innovation and Technology, University of Michigan-Flint
All Institutions: College of Innovation and Technology, University of Michigan-Flint
The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
The proposed TRACE framework introduces a novel approach to detecting partial audio deepfakes without the need for training or labeled data. By analyzing the first-order dynamics of frozen speech foundation model representations, the methodology cleverly leverages the inherent properties of genuine speech versus manipulated audio. This is a significant departure from traditional supervised methods, showcasing a fresh perspective on audio forensics. However, the paper could benefit from a more detailed explanation of the embedding trajectory analysis and its computational efficiency.
The experiments are well-structured, evaluating TRACE on four benchmarks across two languages and using six different speech foundation models. The results demonstrate competitive performance against fine-tuned supervised baselines, particularly in challenging scenarios like LlamaPartialSpoof. However, the paper lacks comprehensive details on the datasets used, such as their sizes and the specific characteristics of the audio samples, which would enhance the understanding of the evaluation's robustness.
The paper does not provide sufficient details regarding the implementation of TRACE, such as the specific configurations of the speech foundation models used or the exact procedures for embedding trajectory analysis. This lack of detail may hinder reproducibility, as other researchers may struggle to replicate the results without clear guidelines or code availability.
One limitation is the reliance on the performance of existing speech foundation models, which may vary in quality and robustness. Additionally, while the training-free approach is innovative, it may not generalize well to all forms of audio manipulation beyond the tested benchmarks. The paper also does not address potential adversarial attacks against the proposed detection method.
The implications of TRACE are significant for the field of audio forensics, particularly in combating misinformation and enhancing the integrity of audio content. The training-free nature of the method could facilitate its adoption in real-world applications where rapid detection is critical, such as in media verification and security. However, further exploration of its applicability across diverse audio manipulation techniques is necessary. The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.
Primary: Sogang University
All Institutions: Sogang University, Institute of Information and Communications Technology Planning and Evaluation (IITP), National Research Foundation of Korea (NRF)
The main contribution of this work is the introduction of SR-CorrNet, a novel asymmetric encoder-decoder framework that improves speech separation in complex acoustic environments by leveraging spatio-spectro-temporal correlations and a dynamic split module. This research significantly advances the state-of-the-art in speech separation, providing a robust solution for real-world applications.
The proposed SR-CorrNet framework introduces a novel asymmetric encoder-decoder architecture that effectively addresses the limitations of late-split designs in speech separation tasks. By employing a separation-reconstruction strategy and a correlation-to-filter paradigm, the methodology enhances speaker discrimination and robustness in challenging acoustic environments. The incorporation of spatio-spectro-temporal correlations as input features is a significant advancement, allowing the model to leverage temporal and spatial dependencies more effectively. The dynamic split module further enhances the model's adaptability to varying speaker counts, which is crucial for real-world applications.
The experiments conducted on multiple datasets (WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS) demonstrate the effectiveness of the proposed method across different conditions, including anechoic, noisy-reverberant, and real-recorded environments. The results show consistent improvements over existing models, indicating the robustness and generalizability of SR-CorrNet. The use of objective metrics like SI-SNRi and SDRi provides a solid basis for evaluating performance, although subjective evaluations could further strengthen the findings.
The paper provides detailed descriptions of the model architecture, training procedures, and datasets used, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the ability for other researchers to replicate the experiments directly. Clearer documentation or supplementary materials could enhance reproducibility.
One limitation of the study is the lack of subjective evaluation metrics, such as human listening tests, which could provide insights into the perceptual quality of the separated audio. Additionally, while the dynamic split module shows promise, its performance in highly variable acoustic environments needs further validation. The model's complexity may also pose challenges in real-time applications.
The advancements in speech separation technology have significant implications for various applications, including automatic speech recognition, hearing aids, and communication systems in noisy environments. The ability to effectively separate overlapping speech can enhance the user experience in real-world scenarios, making this research highly relevant to both academia and industry. The main contribution of this work is the introduction of SR-CorrNet, a novel asymmetric encoder-decoder framework that improves speech separation in complex acoustic environments by leveraging spatio-spectro-temporal correlations and a dynamic split module. This research significantly advances the state-of-the-art in speech separation, providing a robust solution for real-world applications.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.
Primary: Chuzhou University
All Institutions: Chuzhou University, Yeshiva University
The main contribution of this paper is the introduction of MambaVoiceCloning, a novel TTS system that utilizes a fully state-space model conditioning path to improve efficiency and stability while eliminating traditional attention mechanisms. This work significantly advances the field of TTS by addressing key limitations in existing architectures, offering a promising direction for future research and applications.
The methodology of MambaVoiceCloning (MVC) is innovative in its approach to eliminate attention and recurrent layers in a text-to-speech (TTS) system, utilizing a fully state-space model (SSM) conditioning path. The integration of gated bidirectional Mamba text encoders, Temporal Bi-Mamba, and Expressive Mamba with AdaLN modulation showcases a significant architectural redesign that emphasizes efficiency and stability. The linear-time complexity and bounded activation memory are particularly noteworthy, as they address common issues in traditional TTS systems, such as memory pressure and drift in long sequences. The paper provides a clear explanation of the architecture and its components, supported by rigorous theoretical grounding.
The experimental evaluation is comprehensive, utilizing multiple datasets including LJSpeech, LibriTTS, VCTK, and CSS10, which allows for a robust assessment of MVC's performance across various conditions. The paper reports both subjective (MOS, CMOS) and objective metrics (F0 RMSE, MCD, WER), demonstrating statistically significant improvements over baseline models. The inclusion of long-form and cross-lingual evaluations further strengthens the findings, showcasing the model's generalization capabilities. However, while the improvements are statistically reliable, they are described as modest, indicating room for further enhancement.
The authors provide detailed implementation and training protocols, ensuring that the methodology can be reproduced. The use of a unified optimization schedule across all models and the provision of code on GitHub enhances reproducibility. However, the paper could benefit from more explicit details regarding hyperparameter tuning and the specific configurations used for each model.
The paper acknowledges limitations such as the focus on conditioning efficiency over fine-grained emotion control, and the model's training solely on English datasets, which may affect its performance in multilingual contexts. Additionally, the diffusion decoder remains the primary latency bottleneck, which could hinder real-time applications.
The MVC framework has potential implications for real-time TTS applications, particularly in scenarios requiring efficient memory usage and low latency. Its architecture could serve as a drop-in replacement for existing TTS systems, enhancing their deployability in resource-constrained environments. The focus on ethical considerations, such as watermarking and speaker consent, is commendable and highlights the responsible deployment of AI technologies. The main contribution of this paper is the introduction of MambaVoiceCloning, a novel TTS system that utilizes a fully state-space model conditioning path to improve efficiency and stability while eliminating traditional attention mechanisms. This work significantly advances the field of TTS by addressing key limitations in existing architectures, offering a promising direction for future research and applications.
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
Primary: Unknown
All Institutions: Unknown
This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
The paper introduces a novel multimodal dialogue annotation pipeline that curates dialogues from movies and TV series with fine-grained annotations. This approach is significant as it addresses the limitations of existing datasets in terms of expressiveness and diversity. The methodology for generating the MM-Dia dataset and the MM-Dia-Bench testbed is well-articulated, focusing on both explicit and implicit cross-modal control. However, the paper could benefit from a more detailed explanation of the annotation process and the specific criteria used for dialogue selection.
The experiments conducted demonstrate the effectiveness of the MM-Dia dataset in enhancing controllability in multimodal dialogue generation. The evaluation metrics used, while not explicitly detailed in the abstract, are crucial for assessing the performance of the proposed models. The results indicate that current frameworks struggle to replicate the nuanced expressiveness of human interaction, highlighting an important area for future research. However, the paper could improve by providing more comprehensive quantitative results and comparisons with baseline models.
The paper does not provide sufficient details on the implementation of the models or the datasets used, which raises concerns about reproducibility. Clearer guidelines or links to supplementary materials would enhance the ability of other researchers to replicate the findings.
One significant limitation is the reliance on dialogue from movies and TV series, which may not fully capture the diversity of real-world interactions. Additionally, the paper acknowledges limitations in current frameworks to replicate human expressiveness, suggesting that further work is needed to bridge this gap.
The findings of this research have the potential to significantly impact the field of multimodal dialogue systems, particularly in applications such as virtual assistants, interactive storytelling, and entertainment. By improving controllability and expressiveness in dialogue generation, this work could lead to more engaging and human-like interactions in AI systems. This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
Deepfake speech detection systems are often limited to binary classification tasks and struggle to generate interpretable reasoning or provide context-rich explanations for their decisions. These models primarily extract latent embeddings for authenticity detection but fail to leverage structured acoustic evidence such as prosodic, spectral, and physiological attributes in a meaningful manner. This paper introduces CoLMbo-DF, a Feature-Guided Audio Language Model that addresses these limitations by integrating robust deepfake detection with explicit acoustic chain-of-thought reasoning. By injecting structured textual representations of low-level acoustic features directly into the model prompt, our approach grounds the model's reasoning in interpretable evidence and improves detection accuracy. To support this framework, we introduce a novel dataset of audio pairs paired with chain-of-thought annotations. Experiments show that our method, trained on a lightweight open-source language model, significantly outperforms existing audio language model baselines despite its smaller scale, marking a significant advancement in explainable deepfake speech detection.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
The proposed methodology introduces CoLMbo-DF, which innovatively integrates structured acoustic features into a language model framework for deepfake detection. By employing a feature-guided approach that grounds reasoning in explicit acoustic evidence, the authors effectively address the limitations of existing models that primarily rely on latent embeddings. The incorporation of chain-of-thought reasoning adds a layer of interpretability, which is crucial for understanding model decisions in deepfake detection. The methodology is well-structured and demonstrates a clear progression from problem identification to solution development.
The experimental section is robust, showcasing a new dataset of audio pairs with chain-of-thought annotations, which is a significant contribution in itself. The results indicate that CoLMbo-DF outperforms existing baselines, even when trained on a smaller scale model. However, the paper could benefit from a more detailed comparison with a wider range of existing methods and metrics to fully validate the claims of superiority. The evaluation metrics should ideally include both subjective and objective measures to comprehensively assess the model's performance.
The paper lacks detailed implementation specifics that would aid in reproducibility. While the methodology is sound, the absence of code or supplementary materials limits the ability of other researchers to replicate the results. Providing a GitHub repository or supplementary materials with code and data would significantly enhance reproducibility.
One limitation is the reliance on a specific dataset that may not generalize well to all types of deepfake speech. Additionally, while the model improves interpretability, the complexity of integrating structured acoustic features may pose challenges in real-world applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The implications of this research are substantial, particularly in the context of misinformation and digital security. By enhancing deepfake detection systems with interpretable reasoning, the work contributes to the development of more reliable tools for combating audio-based deception. The approach could also be extended to other domains requiring audio analysis and reasoning, such as voice recognition and sentiment analysis. The main contribution of this paper is the introduction of CoLMbo-DF, a novel audio language model that integrates structured acoustic features for improved deepfake speech detection and interpretability. This work represents a significant advancement in the field, addressing critical gaps in existing methodologies and providing a foundation for future research in explainable AI and audio analysis.
Speech enhancement in hearing aids remains a difficult task in nonstationary acoustic environments, mainly because current signal processing algorithms rely on fixed, manually tuned parameters that cannot adapt in situ to different users or listening contexts. This paper introduces a unified modular framework that formulates signal processing, learning, and personalization as Bayesian inference with explicit uncertainty tracking. The proposed framework replaces ad hoc algorithm design with a single probabilistic generative model that continuously adapts to changing acoustic conditions and user preferences. It extends spectral subtraction with principled mechanisms for in-situ personalization and adaptation to acoustic context. The system is implemented as an interconnected probabilistic state-space model, and inference is performed via variational message passing in the \texttt{RxInfer.jl} probabilistic programming environment, enabling real-time Bayesian processing under hearing-aid constraints. Proof-of-concept experiments on the \emph{VoiceBank+DEMAND} corpus show competitive speech quality and noise reduction with 85 effective parameters. The framework provides an interpretable, data-efficient foundation for uncertainty-aware, adaptive hearing-aid processing and points toward devices that learn continuously through probabilistic inference.
Primary: Eindhoven University of Technology
All Institutions: Eindhoven University of Technology, Lazy Dynamics B.V., GN Advanced Science
The main contribution of this paper is the introduction of a unified Bayesian framework for speech enhancement that adapts to user preferences and acoustic conditions in real-time. This work represents a meaningful advancement in the field of audio processing, particularly in the context of hearing aids, by providing a robust and interpretable model that can learn from its environment.
The paper presents a novel approach to speech enhancement by framing the problem within a Bayesian inference framework. This methodology allows for real-time adaptation to varying acoustic environments and user preferences, which is a significant improvement over traditional fixed-parameter algorithms. The use of a probabilistic generative model and variational message passing for inference is well-justified, and the modular architecture enhances the system's flexibility. However, the paper could benefit from a more detailed explanation of the underlying assumptions of the Bayesian model and how they impact the performance in diverse scenarios.
The experiments conducted on the VoiceBank+DEMAND corpus demonstrate the effectiveness of the proposed framework in terms of speech quality and noise reduction. The results are promising, showing competitive performance with a relatively small number of parameters (85). However, the paper lacks a comprehensive comparison with state-of-the-art methods and does not provide subjective evaluations (e.g., MOS scores) that would strengthen the claims of improved speech quality.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the authors mention the use of the RxInfer.jl environment for inference, there is no link to the code or detailed instructions for replicating the experiments, which is critical for the validation of the proposed methods.
One limitation is the reliance on a specific dataset (VoiceBank+DEMAND), which may not fully represent the diversity of real-world acoustic environments. Additionally, the paper does not address potential computational constraints in real-time applications, particularly in terms of the scalability of the model when deployed in actual hearing aids.
The proposed framework has significant implications for the development of adaptive hearing aids that can personalize user experiences in real-time. By enabling continuous learning and adaptation, this research could lead to improved accessibility for individuals with hearing impairments, enhancing their quality of life in various acoustic settings. The main contribution of this paper is the introduction of a unified Bayesian framework for speech enhancement that adapts to user preferences and acoustic conditions in real-time. This work represents a meaningful advancement in the field of audio processing, particularly in the context of hearing aids, by providing a robust and interpretable model that can learn from its environment.
Voice design from natural language aims to generate speaker timbres directly from free-form textual descriptions, allowing users to create voices tailored to specific roles, personalities, and emotions. Such controllable voice creation benefits a wide range of downstream applications-including storytelling, game dubbing, role-play agents, and conversational assistants, making it a significant task for modern Text-to-Speech models. However, existing models are largely trained on carefully recorded studio data, which produces speech that is clean and well-articulated, yet lacks the lived-in qualities of real human voices. To address these limitations, we present MOSS-VoiceGenerator, an open-source instruction-driven voice generation model that creates new timbres directly from natural language prompts. Motivated by the hypothesis that exposure to real-world acoustic variation produces more perceptually natural voices, we train on large-scale expressive speech data sourced from cinematic content. Subjective preference studies demonstrate its superiority in overall performance, instruction-following, and naturalness compared to other voice design models.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
MOSS-VoiceGenerator presents an innovative approach to generating realistic voices from natural language descriptions, addressing key limitations in existing TTS systems. The combination of a diverse training dataset and advanced model architecture positions this work as a significant contribution to the field of audio synthesis and voice generation.
The methodology presented in MOSS-VoiceGenerator is robust and innovative, leveraging a large-scale dataset derived from cinematic content to train a model that generates realistic voices from natural language descriptions without the need for reference audio. The authors employ a comprehensive data collection and annotation process, ensuring a diverse and expressive dataset. The model architecture integrates autoregressive techniques with a discrete framework, which simplifies deployment and enhances instruction-following capabilities. However, the reliance on a specific dataset type (cinematic) may limit generalizability to other domains.
The experimental evaluation is thorough, utilizing both subjective and objective metrics to assess the model's performance. The inclusion of a public benchmark (InstructTTSEval) for objective evaluation adds credibility to the results. The subjective preference studies provide valuable insights into user experience and model performance across different dimensions. The results indicate that MOSS-VoiceGenerator outperforms several existing models, showcasing its effectiveness in generating expressive and natural-sounding speech.
The paper outlines the training strategy and data processing pipeline in detail, which aids in reproducibility. However, the lack of a publicly accessible demo or project URL limits the ability for other researchers to replicate the work easily. Open-sourcing the model and data pipeline would significantly enhance reproducibility and community engagement.
The authors acknowledge several limitations, including the focus on Chinese and English, which restricts language diversity. The English dataset is smaller, potentially affecting performance in English voice generation. Additionally, the denoising process may introduce artifacts, and the model's output can occasionally lack stability. These limitations suggest areas for future improvement and expansion.
MOSS-VoiceGenerator has significant potential applications in various domains such as audiobook narration, game dubbing, and conversational agents, where realistic and expressive voice generation is crucial. The open-source nature of the project could foster further research and development in controllable TTS systems, contributing to advancements in human-computer interaction and accessibility. MOSS-VoiceGenerator presents an innovative approach to generating realistic voices from natural language descriptions, addressing key limitations in existing TTS systems. The combination of a diverse training dataset and advanced model architecture positions this work as a significant contribution to the field of audio synthesis and voice generation.
We introduce ParaSpeechCLAP, a dual-encoder contrastive model that maps speech and text style captions into a common embedding space, supporting a wide range of intrinsic (speaker-level) and situational (utterance-level) descriptors (such as pitch, texture and emotion) far beyond the narrow set handled by existing models. We train specialized ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational models alongside a unified ParaSpeechCLAP-Combined model, finding that specialization yields stronger performance on individual style dimensions while the unified model excels on compositional evaluation. We further show that ParaSpeechCLAP-Intrinsic benefits from an additional classification loss and class-balanced training. We demonstrate our models' performance on style caption retrieval, speech attribute classification and as an inference-time reward model that improves style-prompted TTS without additional training. ParaSpeechCLAP outperforms baselines on most metrics across all three applications. Our models and code are released at https://github.com/ajd12342/paraspeechclap .
Primary: New York University
All Institutions: New York University, The University of Texas at Austin
ParaSpeechCLAP introduces a dual-encoder model that effectively maps speech and rich textual style descriptions into a common embedding space, significantly advancing the state of the art in style-prompted speech applications. The comprehensive evaluation of its performance across multiple tasks and the innovative use of a classification loss for intrinsic attributes highlight its potential impact on the field of audio machine learning.
The methodology presented in ParaSpeechCLAP is robust, utilizing a dual-encoder architecture that effectively aligns speech and text style captions in a shared embedding space. The introduction of specialized models (ParaSpeechCLAP-Intrinsic and ParaSpeechCLAP-Situational) alongside a unified model demonstrates a thoughtful approach to handling diverse stylistic attributes. The use of a multitask loss for the intrinsic model enhances its performance by allowing it to predict specific style attributes, which is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the encoder architecture and the rationale behind the choice of specific models.
The experimental evaluation is thorough, with a clear focus on three applications: style caption retrieval, speech attribute classification, and inference-time guidance for TTS. The results indicate that ParaSpeechCLAP consistently outperforms existing baselines across various metrics, showcasing its effectiveness. The use of multiple datasets and evaluation metrics strengthens the findings, although the paper could enhance clarity by providing more context on the datasets used and their relevance to the tasks.
The paper provides a reasonable level of detail regarding the training setup, including hyperparameters and dataset descriptions. The authors have made their models and code publicly available, which is a positive step towards reproducibility. However, some aspects, such as the specific configurations of the encoder architectures and the training process, could be elaborated further to ensure that other researchers can replicate the results without ambiguity.
One limitation noted is the requirement for selecting the appropriate model variant at inference time, which could complicate deployment in practical applications. Additionally, while the unified model shows promise, it does not outperform specialized models on individual tasks, indicating a potential trade-off in performance that needs to be addressed. The paper also mentions the linear scaling of the best-of-N guidance strategy, which may not be efficient for larger N values.
The implications of this research are significant, particularly in enhancing the capabilities of style-prompted TTS systems and expressive speech retrieval. By supporting a broader range of intrinsic and situational attributes, ParaSpeechCLAP could facilitate advancements in various applications, including expressive spoken dialog systems and personalized speech synthesis. The work also sets a foundation for future research in rich style modeling and evaluation benchmarks, which could further enrich the field. ParaSpeechCLAP introduces a dual-encoder model that effectively maps speech and rich textual style descriptions into a common embedding space, significantly advancing the state of the art in style-prompted speech applications. The comprehensive evaluation of its performance across multiple tasks and the innovative use of a classification loss for intrinsic attributes highlight its potential impact on the field of audio machine learning.
Project VAANI is an initiative to create an India-representative multi-modal dataset that comprehensively maps India's linguistic diversity, starting with 165 districts across the country in its first two phases. Speech data is collected through a carefully structured process that uses image-based prompts to encourage spontaneous responses. Images are captured through a separate process that encompasses a broad range of topics, gathered from both within and across districts. The collected data undergoes a rigorous multi-stage quality evaluation, including both automated and manual checks to ensure highest possible standards in audio quality and transcription accuracy. Following this thorough validation, we have open-sourced around 289K images, approximately 31,270 hours of audio recordings, and around 2,067 hours of transcribed speech, encompassing 112 languages from 165 districts from 31 States and Union territories. Notably, significant of these languages are being represented for the first time in a dataset of this scale, making the VAANI project a groundbreaking effort in preserving and promoting linguistic inclusivity. This data can be instrumental in building inclusive speech models for India, and in advancing research and development across speech, image, and multimodal applications.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science, Robotics Technology Park (ARTPARK), Quest Alliance, Google DeepMind
The VAANI project represents a groundbreaking effort to create a comprehensive dataset that captures India's linguistic diversity, with significant implications for inclusive speech technology development. The methodology is innovative, but the paper would benefit from more detailed experimental evaluations and reproducibility guidelines to maximize its impact in the field.
The methodology employed in the VAANI project is commendable, particularly in its structured approach to data collection and quality evaluation. The use of image-based prompts to elicit spontaneous speech responses is innovative and effectively captures the linguistic diversity of India. The multi-stage quality evaluation process, which includes both automated and manual checks, ensures high standards of audio quality and transcription accuracy, which is crucial for the reliability of the dataset. However, the paper could benefit from a more detailed description of the specific algorithms or techniques used in the quality evaluation process.
The paper outlines a substantial dataset comprising 31,270 hours of audio and 2,067 hours of transcribed speech across 112 languages. This extensive dataset is a significant contribution to the field, particularly for underrepresented languages. However, the paper lacks detailed experimental results demonstrating the effectiveness of the dataset in training inclusive speech models. It would be beneficial to include baseline comparisons or performance metrics to illustrate the dataset's impact on model performance.
The paper does not provide sufficient details regarding the implementation of the data collection and quality evaluation processes, which may hinder reproducibility. While the dataset is open-sourced, additional documentation or guidelines for replicating the data collection methodology would enhance reproducibility.
One limitation of the study is the potential bias in data collection due to the selection of districts and topics for image prompts. Additionally, while the dataset aims to represent linguistic diversity, the focus on only 165 districts may not capture the full spectrum of languages and dialects present in India. The paper also does not address the challenges of data privacy and ethical considerations in collecting speech data from individuals.
The VAANI project has the potential to significantly impact the development of inclusive speech technologies in India, promoting linguistic diversity and accessibility. By providing a comprehensive dataset, it can facilitate research in speech recognition, multimodal applications, and language preservation efforts. This initiative could also inspire similar projects in other linguistically diverse regions, contributing to global efforts in digital inclusivity. The VAANI project represents a groundbreaking effort to create a comprehensive dataset that captures India's linguistic diversity, with significant implications for inclusive speech technology development. The methodology is innovative, but the paper would benefit from more detailed experimental evaluations and reproducibility guidelines to maximize its impact in the field.
Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion Scores (MOS) remain the gold standard but are costly and impractical at scale. We present a hierarchical multimodal architecture for perceptually meaningful dubbing evaluation, integrating complementary cues from audio, video, and text. The model captures fine-grained features such as speaker identity, prosody, and content from audio, facial expressions and scene-level cues from video and semantic context from text, which are progressively fused through intra and inter-modal layers. Lightweight LoRA adapters enable parameter-efficient fine-tuning across modalities. To overcome limited subjective labels, we derive proxy MOS by aggregating objective metrics with weights optimized via active learning. The proposed architecture was trained on 12k Hindi-English bidirectional dubbed clips, followed by fine-tuning with human MOS. Our approach achieves strong perceptual alignment (PCC > 0.75), providing a scalable solution for automatic evaluation of AI-dubbed content.
Primary: unknown
All Institutions: XYZ agency
The paper presents a novel hierarchical multimodal architecture for evaluating AI-dubbed content, addressing the challenges of subjective quality assessment through innovative methodologies and strong experimental validation. The contributions made in this work are significant, providing a foundation for future advancements in automated audio-visual quality assessment.
The proposed hierarchical multimodal architecture is innovative in its integration of audio, video, and text features for evaluating AI-dubbed content. The use of lightweight LoRA adapters for parameter-efficient fine-tuning is a valuable contribution, allowing for effective adaptation across modalities without extensive computational resources. The two-stage training pipeline, which combines active learning for proxy MOS generation with fine-tuning based on human ratings, demonstrates a thoughtful approach to overcoming the challenges of limited subjective labels. However, the methodology could benefit from a more detailed explanation of the active learning process and the specific metrics used to derive Proxy MOS.
The experiments are well-structured, utilizing two publicly available datasets (MELD and M2H2) to validate the proposed model. The use of subjective evaluation with a diverse participant group strengthens the findings, and the reported results show a strong correlation with human ratings, indicating the model's effectiveness. The ablation studies provide insights into the contributions of each modality, although more detailed statistical analysis could enhance the robustness of the claims. The results demonstrate that the hierarchical integration of modalities significantly improves performance, which is a critical finding for the field.
The paper provides a reasonable level of detail regarding the experimental setup, including the datasets, training parameters, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits the reproducibility of the results. Including such resources would significantly enhance the paper's impact and facilitate further research in this area.
The primary limitation is the reliance on human ratings for fine-tuning, which can introduce bias and variability. Additionally, the model's performance may be influenced by the quality and diversity of the training data, which could affect its generalizability to other languages or dubbing contexts. The paper does not address potential challenges in scaling the model to larger datasets or different languages, which could limit its applicability.
The proposed architecture has significant implications for the field of AI-generated content evaluation, particularly in enhancing the quality of dubbed media. By providing a scalable solution for automatic assessment, this research could facilitate the widespread adoption of AI dubbing technologies in various industries, including entertainment and education. Furthermore, the integration of multimodal cues aligns with current trends in AI, promoting more human-centered approaches to content generation and evaluation. The paper presents a novel hierarchical multimodal architecture for evaluating AI-dubbed content, addressing the challenges of subjective quality assessment through innovative methodologies and strong experimental validation. The contributions made in this work are significant, providing a foundation for future advancements in automated audio-visual quality assessment.
We present the first systematic Membership Inference Attack (MIA) evaluation of Large Audio Language Models (LALMs). As audio encodes non-semantic information, it induces severe train and test distribution shifts and can lead to spurious MIA performance. Using a multi-modal blind baseline based on textual, spectral, and prosodic features, we demonstrate that common speech datasets exhibit near-perfect train/test separability (AUC approximately 1.0) even without model inference, and the standard MIA scores strongly correlate with these blind acoustic artifacts (correlation greater than 0.7). Using this blind baseline, we identify that distribution-matched datasets enable reliable MIA evaluation without distribution shift confounds. We benchmark multiple MIA methods and conduct modality disentanglement experiments on these datasets. The results reveal that LALM memorization is cross-modal, arising only from binding a speaker's vocal identity with its text. These findings establish a principled standard for auditing LALMs beyond spurious correlations.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the establishment of a principled standard for auditing LALMs through the identification of distribution-matched datasets and the systematic evaluation of MIA methods. This work is significant as it addresses a critical gap in understanding the vulnerabilities of audio models and sets the stage for future research in this area.
The paper introduces a systematic approach to Membership Inference Attacks (MIA) against Large Audio Language Models (LALMs). The authors utilize a multi-modal blind baseline that incorporates textual, spectral, and prosodic features to evaluate MIA performance. This methodology is innovative as it highlights the challenges posed by non-semantic information in audio datasets, which can lead to misleading MIA results. The identification of distribution-matched datasets for reliable MIA evaluation is a significant methodological contribution, as it provides a clearer framework for understanding model vulnerabilities.
The experiments benchmark multiple MIA methods and include modality disentanglement experiments, which are well-structured and provide insightful results. The correlation of standard MIA scores with blind acoustic artifacts is particularly noteworthy, revealing the potential pitfalls in existing evaluation metrics. The authors present a comprehensive analysis of their findings, demonstrating that LALM memorization is cross-modal, which adds depth to the experimental evaluation.
The paper mentions the use of generative AI tools for manuscript preparation and software development, which raises questions about the reproducibility of the experimental results. However, the authors state that the core ideas and analyses are their original work. The lack of a dedicated project or code repository limits the ability for others to reproduce the experiments fully, which is a significant drawback.
One limitation is the reliance on specific datasets, which may not generalize to all audio language models. Additionally, while the paper identifies distribution shifts as a critical factor in MIA evaluation, it does not explore the implications of these shifts in detail. The absence of a demo or project URL further limits the accessibility of the findings.
The findings have significant implications for the auditing of LALMs, particularly in understanding model vulnerabilities and the risks associated with spurious correlations in audio data. This research could influence future work in the field, particularly in the design of more robust models and evaluation frameworks that account for non-semantic information in audio. The main contribution of this paper is the establishment of a principled standard for auditing LALMs through the identification of distribution-matched datasets and the systematic evaluation of MIA methods. This work is significant as it addresses a critical gap in understanding the vulnerabilities of audio models and sets the stage for future research in this area.
Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We call this failure the evidence bottleneck: state-of-the-art systems show larger deficits in evidence extraction than in downstream reasoning, suggesting that the main limitation lies in upstream perception rather than reasoning policy. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that combines Whisper and CED-Base through non-compressive, time-aligned fusion. EvA first aggregates intermediate CED layers to preserve multi-scale acoustic cues, then aligns the aggregated CED features to the Whisper timeline and adds the two streams without changing sequence length. We also build EvA-Perception, a large-scale open-source training set with about 54K event-ordered captions (150 h) and about 500K QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source Perception scores on MMAU, MMAR, and MMSU, and improves over Kimi-Audio-7B on all reported metrics, with the largest gains on perception-heavy splits. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the EvA architecture, which effectively addresses the evidence bottleneck in large audio language models through a dual-path approach that preserves acoustic evidence before reasoning. This work significantly advances the state of the art in audio understanding, demonstrating the importance of upstream perception in achieving better performance in complex acoustic scenes.
The proposed EvA architecture introduces a dual-path system that effectively combines speech and non-speech audio processing through hierarchical evidence aggregation and non-compressive, time-aligned fusion. This innovative approach addresses the identified evidence bottleneck in existing LALMs, allowing for improved acoustic evidence retention before reasoning. The methodology is well-structured, with clear explanations of the dual-path architecture and the training process, including the creation of the EvA-Perception dataset. However, while the architecture is novel, it builds on existing frameworks such as Whisper and CED-Base, which may limit the perceived originality of the approach.
The experiments conducted demonstrate the effectiveness of the EvA architecture across multiple benchmarks (MMAU, MMAR, MMSU, and CochlScene), with significant improvements in perception-heavy tasks. The results are compelling, showcasing the model's ability to outperform existing systems, particularly in preserving acoustic evidence. The use of a unified zero-shot protocol for evaluation adds rigor to the experimental design. However, the paper could benefit from more detailed comparisons with a broader range of existing models to contextualize the improvements more effectively.
The paper provides a comprehensive overview of the training strategy, including hyperparameters and the two-stage training process. However, the lack of specific implementation details and code availability may hinder full reproducibility. The authors mention that the EvA model is open-source, which is a positive aspect, but the absence of a direct link to the repository limits accessibility for other researchers.
The paper acknowledges several limitations, including the focus on English-only captions in the EvA-Perception dataset and the need for more systematic multilingual evaluation. Additionally, the temporal reasoning capabilities are constrained by the soft event boundaries in the training data, and the model's performance on music analysis is limited by the lack of expert-level concepts. These limitations suggest areas for future work and improvement.
The advancements made by the EvA architecture have significant implications for audio understanding applications, particularly in complex acoustic environments. By improving the retention of acoustic evidence, the model can enhance various tasks, including audio captioning, event detection, and question answering. The open-source nature of the dataset and model also encourages further research and development in the field, potentially leading to more robust audio processing systems in diverse applications. The main contribution of this paper is the introduction of the EvA architecture, which effectively addresses the evidence bottleneck in large audio language models through a dual-path approach that preserves acoustic evidence before reasoning. This work significantly advances the state of the art in audio understanding, demonstrating the importance of upstream perception in achieving better performance in complex acoustic scenes.
Reference Audio-Visual Segmentation (Ref-AVS) aims to segment objects in audible videos based on multimodal cues in reference expressions. Previous methods overlook the explicit recognition of expression difficulty and dominant modality in multimodal cues, over-rely on the quality of the instruction-tuning dataset for object reasoning, and lack reflective validation of segmentation results, leading to erroneous mask predictions. To address these issues, in this paper, we propose a novel training-free Multi-Agent Recognition, Reasoning, and Reflection framework to achieve high-quality Reference Audio-Visual Segmentation, termed MAR3. Incorporating the sociological Delphi theory to achieve robust analysis, a Consensus Multimodal Recognition mechanism is proposed that enables LLM agents to explicitly recognize the difficulty of reference expressions and the dominant modality of multimodal cues. Based on our modality-dominant difficulty rule, we propose an adaptive Collaborative Object Reasoning strategy to reliably reason about the referred object. To further ensure precise mask prediction, we develop a Reflective Learning Segmentation mechanism, in which a check agent examines intermediate segmentation results and iteratively corrects the object text prompt of the segment agent. Experiments demonstrate that MAR3 achieves superior performance (69.2% in J&F) on the Ref-AVSBench dataset, outperforming SOTA by 3.4% absolutely.
Primary: Inner Mongolia University
All Institutions: Inner Mongolia University
The paper presents MAR3, a novel multi-agent framework for Reference Audio-Visual Segmentation that effectively addresses key challenges in multimodal integration and reasoning. Its structured methodology and strong experimental results position it as a significant contribution to the field of audio-visual machine learning.
The proposed MAR3 framework introduces a novel approach to Reference Audio-Visual Segmentation (Ref-AVS) by decomposing the task into three distinct phases: Recognition, Reasoning, and Reflection. This multi-agent system leverages the Consensus Multimodal Recognition mechanism, which incorporates the Delphi theory to enhance the understanding of multimodal cues, and the Collaborative Object Reasoning strategy that adapts to the difficulty of reference expressions. The Reflective Learning Segmentation mechanism further improves segmentation accuracy through iterative corrections. This structured approach is innovative and addresses key limitations in existing methods, such as the reliance on high-quality instruction-tuning datasets and the lack of reflective validation.
The experiments conducted on the Ref-AVSBench dataset demonstrate that MAR3 achieves state-of-the-art performance, surpassing previous methods by a notable margin (3.4% improvement in J&F score). The paper provides a comprehensive evaluation, including ablation studies that validate the effectiveness of each component of the framework. The metrics used (Jaccard index and F-score) are appropriate for the task, and the results are presented clearly, showcasing the advantages of the proposed method.
While the paper outlines the methodology and experimental setup, it lacks detailed implementation specifics that would facilitate reproduction. Information on the models used for each agent and the exact configurations of the experiments is somewhat limited. Providing code or a more detailed supplementary material would enhance reproducibility.
One notable limitation is that the proposed framework has not been specifically designed to handle cases where the reference expression refers to non-existent objects in the video. This could limit its applicability in real-world scenarios where such ambiguities arise. Additionally, the reliance on multiple agents may introduce complexity in deployment and scalability.
The MAR3 framework has significant implications for applications in film production, video editing, and content creation, where accurate segmentation of audio-visual elements is crucial. By improving the reliability of segmentation in dynamic scenes, this research could enhance user experiences in multimedia applications and contribute to advancements in AI-driven content generation. The paper presents MAR3, a novel multi-agent framework for Reference Audio-Visual Segmentation that effectively addresses key challenges in multimodal integration and reasoning. Its structured methodology and strong experimental results position it as a significant contribution to the field of audio-visual machine learning.
Large Language Models (LLMs) are strong decoders for Serialized Output Training (SOT) in two-talker Automatic Speech Recognition (ASR), yet their performance degrades substantially in challenging conditions such as three-talker mixtures. A key limitation is that current systems inject acoustic evidence only through a projected prefix, which can be lossy and imperfectly aligned with the LLM input space, providing insufficient fine-grained grounding during decoding. Addressing this limitation is crucial for robust multi-talker ASR, especially in three-talker mixtures. This paper improves LLM-based multi-talker ASR by explicitly injecting talker-aware acoustic evidence into the decoder. We first revisit Connectionist Temporal Classification (CTC)-derived prefix prompting and compare three variants with increasing acoustic content. The CTC information is obtained using the serialized CTC proposed in our previous works. While acoustic-enriched prompts outperform the SOT-only baseline, prefix-only conditioning remains inadequate for three-talker mixtures. We therefore propose a lightweight gated residual cross-attention adapter and design a two-stage acoustic adaptation framework based on low-rank updates (LoRA). In Stage 1, we insert gated cross-attention adapters after the self-attention sub-layer to stably inject acoustic embeddings as external memory. In Stage 2, we refine both the cross-attention adapters and the pretrained LLM's self-attention projections using parameter-efficient LoRA, improving robustness for large backbones under limited data; the learned updates are merged into the base weights for inference. Experiments on Libri2Mix/Libri3Mix under clean and noisy conditions show consistent gains, with particularly large improvements in three-talker settings.
Primary: Kyoto University
All Institutions: Kyoto University, National Institute of Information and Communications Technology
This paper presents a significant advancement in multi-talker ASR by introducing a two-stage acoustic adaptation framework that enhances the integration of acoustic evidence into LLMs. The innovative methodology and promising experimental results position it as a valuable contribution to the field of audio processing and machine learning.
The paper proposes a two-stage acoustic adaptation framework that integrates gated cross-attention adapters into a large language model (LLM) for multi-talker automatic speech recognition (ASR). The methodology is innovative in addressing the limitations of prefix-only conditioning by dynamically injecting talker-aware acoustic evidence during decoding. The use of low-rank updates (LoRA) for parameter-efficient adaptation is particularly noteworthy, as it enhances the model's robustness under limited data conditions. The systematic exploration of CTC-derived prefix prompting variants adds depth to the methodology, although the paper could benefit from a clearer description of the experimental setup and hyperparameter tuning processes.
The experiments are well-structured, utilizing the Libri2Mix and Libri3Mix datasets to evaluate the proposed methods under both clean and noisy conditions. The results demonstrate consistent performance gains, particularly in three-talker scenarios, validating the effectiveness of the proposed gated cross-attention adapters. However, the paper lacks detailed statistical analysis of the results and comparisons with state-of-the-art methods beyond the baseline systems, which would strengthen the claims of improvement.
The paper provides a comprehensive overview of the model architecture and training procedures, but it lacks specific implementation details that would facilitate reproducibility, such as exact hyperparameter settings and training schedules. The absence of a publicly available code repository further hinders reproducibility efforts.
A notable limitation is the reliance on the performance of the gated cross-attention mechanism, which, while effective, still falls short of the robustness offered by the serialized CTC approach in certain scenarios. Additionally, the paper does not address the potential computational overhead introduced by the proposed adaptations, which may limit practical deployment in real-time systems.
The advancements presented in this paper have significant implications for multi-talker ASR systems, particularly in applications such as conference transcription, voice assistants, and accessibility technologies. By improving the ability of LLMs to handle overlapping speech, the research could enhance communication tools for diverse user groups, including those with hearing impairments. This paper presents a significant advancement in multi-talker ASR by introducing a two-stage acoustic adaptation framework that enhances the integration of acoustic evidence into LLMs. The innovative methodology and promising experimental results position it as a valuable contribution to the field of audio processing and machine learning.