We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.
Primary: Sogang University
All Institutions: Sogang University, Institute of Information and Communications Technology Planning and Evaluation (IITP), National Research Foundation of Korea (NRF)
The main contribution of this work is the introduction of SR-CorrNet, a novel asymmetric encoder-decoder framework that improves speech separation in complex acoustic environments by leveraging spatio-spectro-temporal correlations and a dynamic split module. This research significantly advances the state-of-the-art in speech separation, providing a robust solution for real-world applications.
The proposed SR-CorrNet framework introduces a novel asymmetric encoder-decoder architecture that effectively addresses the limitations of late-split designs in speech separation tasks. By employing a separation-reconstruction strategy and a correlation-to-filter paradigm, the methodology enhances speaker discrimination and robustness in challenging acoustic environments. The incorporation of spatio-spectro-temporal correlations as input features is a significant advancement, allowing the model to leverage temporal and spatial dependencies more effectively. The dynamic split module further enhances the model's adaptability to varying speaker counts, which is crucial for real-world applications.
The experiments conducted on multiple datasets (WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS) demonstrate the effectiveness of the proposed method across different conditions, including anechoic, noisy-reverberant, and real-recorded environments. The results show consistent improvements over existing models, indicating the robustness and generalizability of SR-CorrNet. The use of objective metrics like SI-SNRi and SDRi provides a solid basis for evaluating performance, although subjective evaluations could further strengthen the findings.
The paper provides detailed descriptions of the model architecture, training procedures, and datasets used, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the ability for other researchers to replicate the experiments directly. Clearer documentation or supplementary materials could enhance reproducibility.
One limitation of the study is the lack of subjective evaluation metrics, such as human listening tests, which could provide insights into the perceptual quality of the separated audio. Additionally, while the dynamic split module shows promise, its performance in highly variable acoustic environments needs further validation. The model's complexity may also pose challenges in real-time applications.
The advancements in speech separation technology have significant implications for various applications, including automatic speech recognition, hearing aids, and communication systems in noisy environments. The ability to effectively separate overlapping speech can enhance the user experience in real-world scenarios, making this research highly relevant to both academia and industry. The main contribution of this work is the introduction of SR-CorrNet, a novel asymmetric encoder-decoder framework that improves speech separation in complex acoustic environments by leveraging spatio-spectro-temporal correlations and a dynamic split module. This research significantly advances the state-of-the-art in speech separation, providing a robust solution for real-world applications.
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.
Primary: Northwestern Polytechnical University
All Institutions: Nanjing University, Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
The methodology presented in the paper is innovative, leveraging an end-to-end Speech LLM architecture that integrates multi-turn temporal reasoning with a speaker-aware context cache. The iterative global-to-local processing approach is a significant departure from traditional single-pass models, addressing the challenges of overlapping speech and rapid turn-taking effectively. The three-stage progressive training strategy is well-conceived, allowing the model to learn complex interactions and maintain speaker consistency across long-form audio. However, the paper could benefit from a more detailed explanation of the training process and the specific mechanisms used for temporal reasoning.
The experiments are robust, utilizing two well-defined datasets (AliMeeting and AISHELL-4) that reflect real-world challenges in multi-speaker scenarios. The reported results show consistent improvements over strong baselines, particularly in metrics relevant to speaker attribution and transcription accuracy. The use of multiple evaluation metrics (DER, CER, cpCER) provides a comprehensive view of the model's performance. However, the paper lacks a thorough comparison with other state-of-the-art models beyond the immediate baselines, which would strengthen the claims of superiority.
The paper provides sufficient details regarding the model architecture, training procedures, and datasets, which are crucial for reproducibility. The use of established frameworks (e.g., MS-Swift, Megatron-LM) and the clear description of the training stages contribute positively to reproducibility. However, the absence of publicly available code or a demo limits the ease of replication by other researchers.
One limitation of the proposed model is its reliance on the quality of the training data, which may not generalize well to all multi-speaker environments. Additionally, while the speaker-aware cache is a novel approach, it may introduce complexity in managing speaker identities over long recordings. The performance on long-form audio without manual segmentation could also be a concern, as it may not perform as well in highly dynamic environments.
The implications of this research are significant, particularly for applications in meeting transcription, intelligent assistants, and any domain requiring accurate speaker attribution in multi-speaker contexts. The advancements in handling overlapping speech and rapid turn-taking could enhance the usability of speech recognition systems in real-world scenarios, leading to improved accessibility and communication tools. The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
Primary: Ben Gurion University, Be'er Sheva, Israel
All Institutions: Ben Gurion University, University of Haifa
The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
The proposed split-and-conquer framework effectively decomposes the complex task of partial deepfake speech detection into two distinct stages: boundary detection and segment-level classification. This separation allows for a more focused learning objective, enhancing the model's ability to localize manipulated regions accurately. The use of a dedicated boundary detector to identify transition points is a significant methodological innovation, as it reduces the ambiguity and noise typically associated with joint localization and classification tasks. The introduction of a reflection-based multi-length training strategy is also noteworthy, as it generates diverse feature-space representations, improving robustness and performance across various temporal resolutions.
The experiments conducted on the PartialSpoof and Half-Truth datasets demonstrate state-of-the-art performance, showcasing the effectiveness of the proposed method. The results indicate substantial improvements in both detection accuracy and localization capabilities, particularly at stricter evaluation criteria. The comprehensive evaluation across multiple configurations, feature extractors, and augmentation strategies provides a robust assessment of the method's performance, highlighting its generalization capabilities and robustness to boundary estimation errors.
The paper provides detailed descriptions of the experimental setup, including model architectures, training procedures, and evaluation metrics, which enhances reproducibility. The availability of a project repository on GitHub further supports reproducibility efforts, allowing other researchers to replicate the experiments and build upon the proposed framework.
Despite the strengths of the proposed method, there are notable limitations. The reliance on boundary prediction can introduce errors that propagate through the classification stage, particularly in challenging transition regions. Additionally, the assumption that manipulated content can be approximated by piecewise-uniform segments may not fully capture more gradual or subtle manipulations, which could limit the method's applicability in real-world scenarios.
The implications of this research are significant, particularly in the context of security-critical systems that rely on voice-based authentication and speaker verification. The ability to detect partial deepfake speech can enhance the integrity of communication systems and mitigate risks associated with audio deepfakes. Furthermore, the methodological advancements presented in this work may inspire further research in audio forensics and anti-spoofing technologies. The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.
Primary: Martha Stewart Enterprises
All Institutions: Martha Stewart Enterprises, Allied Widgets Research
The main contribution of this paper is the introduction of DynFOA, a novel framework that synthesizes first-order ambisonics from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. This work significantly advances the state of spatial audio generation, addressing critical challenges in modeling complex acoustic environments.
The methodology presented in DynFOA is robust and innovative, integrating conditional diffusion modeling with 3D scene reconstruction to generate first-order ambisonics (FOA) from 360-degree videos. The approach effectively combines sound source localization, depth estimation, semantic segmentation, and material property extraction, which are critical for accurately modeling complex acoustic environments. The use of 3D Gaussian Splatting (3DGS) for scene reconstruction is a notable strength, as it allows for a detailed representation of the environment that informs the audio generation process. The conditional diffusion generator is well-structured, leveraging multimodal features for improved audio synthesis, which is a significant advancement over previous methods that lacked physical grounding.
The experimental evaluation is thorough, with the introduction of the M2G-360 dataset specifically designed to test the model under challenging acoustic conditions. The paper presents a comprehensive set of experiments that demonstrate the superiority of DynFOA over existing methods in terms of spatial accuracy, acoustic fidelity, and user perception metrics. The results are compelling, showing significant improvements in performance metrics such as Direction of Arrival (DOA) estimation and Signal-to-Noise Ratio (SNR), which are critical for validating the model's effectiveness in real-world scenarios.
The paper provides detailed implementation specifics, including the architecture of the model, training protocols, and the datasets used. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. The reliance on a distributed computing cluster for training may also pose challenges for researchers with limited resources.
One limitation of the study is the reliance on a fixed set of HRTFs for binaural rendering, which may not account for individual differences in hearing or head-related transfer functions. Additionally, while the M2G-360 dataset is a significant contribution, it may still not encompass all possible acoustic environments, particularly outdoor settings or highly variable conditions. The model's performance in such scenarios remains to be evaluated.
The implications of this research are substantial, particularly for the fields of virtual reality and immersive media. By enabling the generation of high-fidelity spatial audio that accurately reflects complex acoustic environments, DynFOA has the potential to enhance user experiences in gaming, film, and virtual environments. The methodology could also inspire future research in audio synthesis and multimodal learning, paving the way for more advanced audio-visual integration techniques. The main contribution of this paper is the introduction of DynFOA, a novel framework that synthesizes first-order ambisonics from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. This work significantly advances the state of spatial audio generation, addressing critical challenges in modeling complex acoustic environments.
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.
Primary: & Science University
All Institutions: & Science University, University of Michigan
This paper introduces a novel embedding-first approach to target speech extraction that eliminates the need for enrollment utterances, significantly enhancing the practicality of TSE systems in real-world environments. The methodology is innovative and well-executed, with promising experimental results that demonstrate its potential impact on the field of audio processing.
The paper presents a novel approach to target speech extraction (TSE) by eliminating the need for enrollment utterances, which is a significant limitation in practical applications. The authors propose a multi-speaker embedding encoder that directly maps noisy mixtures to a set of candidate speaker embeddings. This method utilizes permutation-invariant teacher supervision to ensure that the embeddings align with a single-speaker embedding space, thus maintaining structural integrity in the presence of noise and overlapping speech. The methodology is well-structured, leveraging existing frameworks like WavLM while innovating on the embedding extraction process. The use of a teacher-student model for training the embeddings is particularly noteworthy, as it enhances the robustness of the embeddings against noise.
The experimental setup is thorough, utilizing both synthetic datasets (LibriMix) and real-world recordings (DNS Challenge) to evaluate the proposed method. The authors provide a comprehensive set of metrics for assessing the quality of the embeddings and the performance of the TSE systems, including clustering accuracy and standard speech enhancement metrics (SI-SDR, PESQ, STOI). The results demonstrate that the proposed embeddings significantly improve TSE performance compared to traditional methods, indicating the effectiveness of the approach. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize its contributions further.
The paper outlines the architecture and training procedures in sufficient detail, allowing for reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the results fully. Including a link to a GitHub repository or similar would enhance reproducibility and facilitate further research in this area.
One limitation of the study is the focus on a maximum of three speakers, which may not generalize well to environments with a higher number of overlapping speakers. Additionally, while the paper discusses the robustness of the embeddings, it does not extensively address potential failure cases, such as when speakers have similar voice characteristics or when the background noise is particularly challenging.
The proposed method has significant implications for real-world applications in personal audio devices, such as hearing aids and smart speakers, where the ability to isolate a target speaker in noisy environments is crucial. By removing the need for enrollment, the approach enhances usability and accessibility, making it easier for users to interact with technology in everyday situations. The research could also inspire further innovations in multi-speaker systems and applications in areas such as teleconferencing and assistive technologies. This paper introduces a novel embedding-first approach to target speech extraction that eliminates the need for enrollment utterances, significantly enhancing the practicality of TSE systems in real-world environments. The methodology is innovative and well-executed, with promising experimental results that demonstrate its potential impact on the field of audio processing.
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
Primary: QualiaLabs
All Institutions: QualiaLabs
FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
The methodology presented in FastTurn is innovative, combining streaming CTC decoding with acoustic features to enhance turn detection in full-duplex dialogue systems. The architecture is well-structured, comprising three main components that progressively integrate semantic and acoustic cues. The use of a four-stage training pipeline is commendable, as it stabilizes the optimization process and aligns speech and text modalities effectively. However, the reliance on CTC for initial transcription raises concerns about potential error propagation in noisy environments.
The experiments are thorough, utilizing a diverse set of datasets and a comprehensive evaluation framework. The introduction of a new test set with realistic human dialogue scenarios is a significant contribution, allowing for better assessment of the model's performance in practical applications. The results demonstrate that FastTurn outperforms existing baselines in terms of accuracy and latency, underscoring its effectiveness. However, the paper could benefit from additional comparisons with more recent models in the field to contextualize its performance.
The paper provides sufficient details regarding the model architecture, training strategy, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a demo could hinder independent verification of results. Clear instructions for reproducing the experiments would enhance the paper's impact.
One limitation is the potential sensitivity of the model to CTC errors, especially in overlapping speech scenarios. Additionally, while the model shows robustness in various conditions, the performance on English datasets did not meet expectations, indicating a need for further optimization. The paper also does not address the computational resources required for training and inference, which could be a barrier for broader adoption.
The FastTurn framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interaction, such as virtual assistants and customer service bots. By improving turn detection, it can enhance user experience and facilitate more natural conversations. The release of the new dataset also opens avenues for future research in dialogue systems, potentially leading to advancements in multimodal interaction technologies. FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram-domain enhancement, providing complementary cues. Outputs from both branches are fused by a post-processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative-predictive fusion improves robustness and perceptual quality, achieving top performance in the blind-test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin-rong.github.io/gap-urgenet_demo.
Primary: Nanjing University
All Institutions: Nanjing University
The main contribution of this paper is the introduction of GAP-URGENet, a novel generative-predictive fusion framework for universal speech enhancement that demonstrates state-of-the-art performance in the ICASSP 2026 URGENT Challenge. This work significantly advances the field of speech enhancement by effectively integrating generative and predictive methodologies, providing a comprehensive solution to improve speech quality across diverse conditions.
The methodology presented in GAP-URGENet is innovative, combining generative and predictive models to enhance speech quality effectively. The generative branch focuses on full-stack speech restoration using self-supervised learning, while the predictive branch enhances the spectrogram domain, allowing for complementary improvements. The fusion of outputs from both branches through a post-processing module is a significant contribution, particularly the bandwidth extension to achieve high-quality waveforms. The architecture is well-structured, leveraging existing models like DeWavLM and TF-GridNet, which indicates a thoughtful integration of prior work with novel enhancements.
The experimental setup is robust, utilizing comprehensive datasets from the URGENT Challenge, which enhances the credibility of the results. The paper reports substantial improvements over baseline models, with detailed metrics provided for various objective evaluations (DNSMOS, NISQA, UTMOS, etc.), showcasing the effectiveness of the proposed framework. The results indicate that GAP-URGENet achieves superior performance in both objective and subjective evaluations, validating the proposed approach.
The paper provides sufficient details regarding the architecture, training process, and datasets used, which facilitates reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers. Including a link to the code or detailed implementation instructions would enhance reproducibility significantly.
While the paper demonstrates impressive results, it does not address potential limitations such as the computational cost of the model, the need for extensive training data, or the model's performance in real-world applications outside the challenge context. Additionally, the reliance on specific architectures may limit generalizability to other tasks or domains.
The implications of this research extend to various applications in speech enhancement, including telecommunications, assistive technologies for the hearing impaired, and voice recognition systems. By improving speech quality in challenging conditions, the framework can enhance user experience across multiple platforms, making it a valuable contribution to the field of audio processing. The main contribution of this paper is the introduction of GAP-URGENet, a novel generative-predictive fusion framework for universal speech enhancement that demonstrates state-of-the-art performance in the ICASSP 2026 URGENT Challenge. This work significantly advances the field of speech enhancement by effectively integrating generative and predictive methodologies, providing a comprehensive solution to improve speech quality across diverse conditions.
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet's interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.
Primary: National University of Singapore
All Institutions: National University of Singapore, Shenzhen Loop Area Institute, Nanjing University, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong
The paper presents PhiNet, a self-interpretable speaker verification network that enhances transparency in decision-making by leveraging phonetic evidence. This contribution is significant as it addresses the critical need for interpretability in automatic speaker verification systems, bridging the gap between ASV and forensic speaker comparison.
The proposed PhiNet framework introduces a novel approach to speaker verification by integrating phonetic interpretability into the decision-making process. The architecture is designed to provide both local and global interpretability, allowing users to understand the contribution of individual phonemes to the verification score. This is achieved through a phonetic trait extractor and a decision layer that weights phonetic contributions based on their distinctiveness. The methodology is well-structured, leveraging existing neural network techniques while innovatively adapting them to enhance interpretability in ASV systems.
The experiments conducted on benchmark datasets such as VoxCeleb, SITW, and LibriSpeech demonstrate that PhiNet achieves competitive performance compared to traditional black-box ASV models. The evaluation metrics, including equal error rate (EER) and minimum detection cost function (minDCF), provide a solid basis for performance comparison. Additionally, the paper includes qualitative assessments of interpretability through visualizations and leave-$i$th-phoneme-out experiments, which substantiate the claims of enhanced interpretability.
The authors provide a GitHub repository with the code for PhiNet, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameter settings and data preprocessing steps, to facilitate easier reproduction of results by other researchers.
One limitation is the potential for cognitive bias in phoneme weighting, which could affect the model's interpretability and robustness. Additionally, while the framework shows promise, the reliance on phonetic traits may limit its generalizability to diverse speaker populations or languages not represented in the training data. The paper also does not address the computational complexity of the model, which may hinder real-time applications.
The integration of phonetic interpretability into ASV systems has significant implications for high-accountability applications, such as forensic analysis and security. By providing interpretable results, PhiNet can enhance user trust in automated systems and facilitate error tracing in speaker verification tasks. This work could pave the way for more transparent AI systems in sensitive applications, contributing positively to the field of machine learning and audio processing. The paper presents PhiNet, a self-interpretable speaker verification network that enhances transparency in decision-making by leveraging phonetic evidence. This contribution is significant as it addresses the critical need for interpretability in automatic speaker verification systems, bridging the gap between ASV and forensic speaker comparison.
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment. In this paper, we propose \textbf{SCAR}, a robust ECG--language pretraining framework for \textbf{S}emantic \textbf{C}ompensation via \textbf{A}dversarial \textbf{R}emoval. SCAR improves robustness by explicitly training the model to remain semantically aligned with semantically critical missingness and to recover diagnostic meaning from the remaining visible evidence. Specifically, we introduce a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn representations that remain semantically aligned with clinical text even when primary diagnostic evidence is missing. Under such adversarial corruption, we equip the ECG encoder with a semantically supervised adaptive selector that learns to reweight the remaining visible tokens and compensate with secondary yet diagnostically informative morphological cues. To evaluate robustness beyond classification accuracy, we further introduce Counterfactual Missingness Resolution Score (CMRS), which quantifies how well feature preserve diagnostic semantics under missingness. Experiments on $6$ datasets show that SCAR consistently improves semantic robustness under joint lead and temporal missingness, with particularly clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.
Primary: University of Science and Technology Beijing
All Institutions: School of Intelligence Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing
The paper presents SCAR, a robust ECG--language pretraining framework that enhances zero-shot ECG diagnosis by explicitly addressing the challenges posed by missing data through innovative adversarial techniques. The methodology and results contribute meaningfully to the field of machine learning in healthcare, particularly in improving the robustness of diagnostic models under real-world conditions.
The proposed SCAR framework introduces a novel approach to address the challenge of missing ECG data during zero-shot diagnosis by employing adversarial masking to force the model to learn robust representations. The methodology is well-structured, utilizing a differentiable adversarial masker and a semantically supervised adaptive selector, which collectively enhance the model's ability to maintain semantic alignment even under partial observation. The introduction of the Counterfactual Missingness Resolution Score (CMRS) as a metric for evaluating robustness adds significant value to the methodology, allowing for a more nuanced assessment of performance under missingness.
The experiments are comprehensive, utilizing six datasets to validate the effectiveness of SCAR against existing baselines. The results demonstrate significant improvements in both zero-shot classification performance and robustness under various missingness scenarios, particularly highlighting the advantages of the proposed methods in harder cases where primary diagnostic evidence is absent. The ablation studies effectively illustrate the contributions of each component of the framework, reinforcing the robustness of the findings.
The paper provides sufficient implementation details, including training protocols, dataset descriptions, and evaluation metrics, which support reproducibility. However, the absence of a publicly available code repository or demo limits the ease of reproduction for external researchers.
One limitation is the reliance on specific datasets for training and evaluation, which may affect the generalizability of the results to other ECG datasets or clinical settings. Additionally, while the proposed methods show improvements, the paper does not extensively discuss the computational costs associated with the adversarial masking and adaptive selection processes during training.
The implications of this work are significant for clinical practice, as it addresses a common issue in ECG analysisโmissing data due to various artifacts. The ability to maintain diagnostic accuracy under such conditions can enhance the reliability of ECG-based diagnoses in real-world scenarios, potentially leading to better patient outcomes. The framework could also inspire further research in robust multimodal learning across other medical domains. The paper presents SCAR, a robust ECG--language pretraining framework that enhances zero-shot ECG diagnosis by explicitly addressing the challenges posed by missing data through innovative adversarial techniques. The methodology and results contribute meaningfully to the field of machine learning in healthcare, particularly in improving the robustness of diagnostic models under real-world conditions.
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Graduate School of Engineering, Third Intelligence, Matsuo Institute, Department of Technology Management for Innovation
The main contribution of this work is the development of T5Gemma-TTS, a novel encoder-decoder model that enhances multilingual zero-shot text-to-speech synthesis through innovative architectural improvements and rigorous experimental validation. This research represents a meaningful advancement in the field of speech synthesis, addressing key challenges and setting a foundation for future exploration in multilingual and cross-lingual applications.
The paper introduces T5Gemma-TTS, an encoder-decoder model that effectively addresses the limitations of autoregressive decoder-only architectures by maintaining persistent text conditioning through cross-attention mechanisms. The integration of Progress-Monitoring Rotary Position Embedding (PM-RoPE) is a significant methodological advancement, allowing for improved duration control during speech synthesis. The model's architecture is well-founded on the T5Gemma pretrained backbone, which enhances its linguistic capabilities without requiring phoneme conversion. The methodology is robust and clearly articulated, demonstrating a thoughtful approach to overcoming existing challenges in zero-shot TTS.
The experimental evaluation is comprehensive, involving a substantial training dataset of 170,000 hours of multilingual speech. The results indicate statistically significant improvements in speaker similarity and character error rates compared to existing models. The paper provides detailed comparisons against multiple baselines, showcasing the model's effectiveness across different languages, including Japanese, Chinese, and Korean. The use of confidence intervals adds rigor to the statistical claims, although some results should be interpreted cautiously due to overlapping intervals.
The authors have made the model weights and code publicly available, which is a positive step towards reproducibility. However, the paper would benefit from more detailed implementation specifics and hyperparameter settings to facilitate easier replication of results by other researchers.
The paper acknowledges several limitations, including higher word error rates on unseen European languages and a real-time factor that may not meet the demands of real-time applications. Additionally, the authors note that the model's performance on certain metrics may be influenced by the codec's limitations, indicating areas for future improvement.
The potential for misuse of zero-shot voice cloning technology is a significant concern, as highlighted by the authors. They emphasize the need for ethical considerations and safeguards in deploying such technologies, which is crucial given the implications for privacy and security. The authors advocate for responsible use and further research into detection methods for synthetic speech. The main contribution of this work is the development of T5Gemma-TTS, a novel encoder-decoder model that enhances multilingual zero-shot text-to-speech synthesis through innovative architectural improvements and rigorous experimental validation. This research represents a meaningful advancement in the field of speech synthesis, addressing key challenges and setting a foundation for future exploration in multilingual and cross-lingual applications.
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model's speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model's cross-linguistic robustness, not only supports that the CDMA framework's approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.
Primary: Zhejiang University
All Institutions: Zhejiang University, Universitร della Campania โLuigi Vanvitelliโ, UKRI, EPSRC, National Natural Science Foundation of China, State Key Laboratory of Brain-Machine Intelligence
This study provides a novel approach to depression detection by integrating speech analysis and neurophysiological validation, demonstrating the critical role of emotional arousal over valence in enhancing detection performance. The methodology and results contribute significantly to the field of computational mental health, offering a framework that is both innovative and applicable across linguistic boundaries.
The paper employs a robust methodology by extending the Cross-Data Multilevel Attention (CDMA) framework to a new linguistic context (Chinese Mandarin) and integrating EEG data for neurophysiological validation. The fusion of read and spontaneous speech across emotional valences is a significant methodological advancement, allowing for a nuanced understanding of emotional arousal in depression detection. The attention mechanisms used are well-justified and effectively enhance the model's performance.
The experiments are comprehensive, utilizing a well-defined dataset (MODMA) and employing rigorous cross-validation techniques. The reported F1-scores (up to 89.6%) demonstrate state-of-the-art performance, and the inclusion of EEG analysis adds a layer of validation that strengthens the findings. The statistical comparisons between different emotional contexts and their impact on detection performance are well-articulated.
The paper provides detailed descriptions of the data acquisition, preprocessing, and model training processes, which supports reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
The study acknowledges limitations such as the modest sample size for EEG recordings and the correlational nature of the findings, which precludes causal inferences. Additionally, the lack of information regarding participants' medication status and comorbidities could influence the results.
The findings have significant implications for clinical practices in mental health, particularly in developing objective diagnostic tools for depression that can be applied across different languages. The neurophysiological validation of speech-based models could pave the way for more interpretable and trustworthy AI systems in mental health assessment. This study provides a novel approach to depression detection by integrating speech analysis and neurophysiological validation, demonstrating the critical role of emotional arousal over valence in enhancing detection performance. The methodology and results contribute significantly to the field of computational mental health, offering a framework that is both innovative and applicable across linguistic boundaries.
Audio-Visual Navigation (AVN) requires an embodied agent to navigate toward a sound source by utilizing both vision and binaural audio. A core challenge arises in complex acoustic environments, where binaural cues become intermittently unreliable, particularly when generalizing to previously unheard sound categories. To address this, we propose RAVN (Reliability-Aware Audio-Visual Navigation), a framework that conditions cross-modal fusion on audio-derived reliability cues, dynamically calibrating the integration of audio and visual inputs. RAVN introduces an Acoustic Geometry Reasoner (AGR) that is trained with geometric proxy supervision. Using a heteroscedastic Gaussian NLL objective, AGR learns observation-dependent dispersion as a practical reliability cue, eliminating the need for geometric labels during inference. Additionally, we introduce Reliability-Aware Geometric Modulation (RAGM), which converts the learned cue into a soft gate to modulate visual features, thereby mitigating cross-modal conflicts. We evaluate RAVN on SoundSpaces using both Replica and Matterport3D environments, and the results show consistent improvements in navigation performance, with notable robustness in the challenging unheard sound setting.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The paper presents a significant advancement in audio-visual navigation through the introduction of RAVN, a reliability-aware framework that enhances navigation performance in complex acoustic environments. The methodology is innovative, and the empirical results demonstrate its effectiveness, marking a meaningful contribution to the field of embodied AI and multimodal integration.
The paper introduces a novel framework, RAVN, that effectively integrates reliability-aware geometric fusion for audio-visual navigation. The methodology is well-structured, leveraging an Acoustic Geometry Reasoner (AGR) to derive reliability cues from audio inputs and a Reliability-Aware Geometric Modulation (RAGM) mechanism to adaptively gate visual features based on these cues. This approach is innovative in its use of heteroscedastic Gaussian NLL objectives to model uncertainty, which is a significant advancement over traditional static fusion methods. The design is theoretically sound and aligns well with human-like decision-making processes in ambiguous auditory environments.
The experimental setup is robust, utilizing two well-known datasets (Replica and Matterport3D) that provide a comprehensive evaluation of the proposed method's performance. The results demonstrate significant improvements in navigation success rates, particularly in challenging scenarios involving unheard sounds. The quantitative metrics (Success Rate, Success weighted by Path Length, and Success weighted by Number of Actions) are appropriate and effectively illustrate the advantages of the proposed method over existing baselines. Qualitative results further support the claims, showing improved trajectory following and decision-making stability.
The paper provides sufficient detail regarding the experimental setup, including training protocols, dataset descriptions, and evaluation metrics. However, the absence of a publicly available code repository limits full reproducibility. Future work should consider releasing the code and models to facilitate further research and validation of the findings.
One limitation is the reliance on simulated environments, which may not fully capture the complexities of real-world acoustic conditions. Additionally, while the framework shows promise, its performance in extremely noisy or dynamic environments remains untested. The paper also does not address potential computational overhead introduced by the reliability-aware mechanisms, which could affect real-time applications.
The proposed framework has significant implications for the development of more robust embodied agents capable of navigating complex environments. Applications could extend to robotics, autonomous vehicles, and assistive technologies, enhancing their ability to operate in real-world scenarios where audio-visual cues are unreliable. The focus on reliability-aware fusion could lead to advancements in human-robot interaction and improve the safety and efficiency of autonomous systems. The paper presents a significant advancement in audio-visual navigation through the introduction of RAVN, a reliability-aware framework that enhances navigation performance in complex acoustic environments. The methodology is innovative, and the empirical results demonstrate its effectiveness, marking a meaningful contribution to the field of embodied AI and multimodal integration.
While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Czech Science Foundation, e-INFRA CZ project, Ministry of Education, Youth and Sports of the Czech Republic
The paper presents a novel multi-objective evolutionary framework for fusing deepfake speech detectors, achieving state-of-the-art performance while significantly reducing system complexity. This work is a substantial contribution to the field of audio machine learning, providing a comprehensive approach to tackle the challenges posed by deepfake technologies.
The paper introduces an innovative multi-objective evolutionary framework for fusing deepfake speech detectors using NSGA-II, addressing the critical balance between detection accuracy and system complexity. It explores two encoding strategiesโbinary-coded detector selection and real-valued weight optimizationโdemonstrating a systematic approach to ensemble learning that is both effective and efficient. The methodology is well-structured, leveraging evolutionary algorithms to navigate the trade-offs inherent in deepfake detection.
The authors conduct extensive experiments on the ASVspoof 5 dataset, utilizing a diverse pool of 36 SSL-based detectors. The results are robust, showcasing the superiority of the proposed methods over traditional fusion techniques, including simple averaging and logistic regression. The achieved EER of 2.37% indicates a significant performance improvement while reducing system complexity, underscoring the effectiveness of the proposed approach.
The paper provides detailed implementation information, including parameter settings, computational resources, and the use of a GitHub repository for code access. The thoroughness of the experimental setup and the availability of the code enhance the reproducibility of the results, allowing other researchers to validate and build upon this work.
While the proposed method is effective, it is limited by its reliance on score-level fusion, which may overlook deeper interactions that could be exploited through joint fine-tuning of the models. Additionally, the performance is constrained by the quality of the underlying detectors, suggesting that optimizing these base models could further enhance the fusion outcomes.
This research has significant implications for the field of deepfake detection, particularly in enhancing the robustness and efficiency of voice biometric systems. The ability to balance performance and complexity in detector fusion can lead to more practical applications in security and authentication, addressing the growing concerns surrounding deepfake technology. The paper presents a novel multi-objective evolutionary framework for fusing deepfake speech detectors, achieving state-of-the-art performance while significantly reducing system complexity. This work is a substantial contribution to the field of audio machine learning, providing a comprehensive approach to tackle the challenges posed by deepfake technologies.
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.
Primary: University of Michigan
All Institutions: University of Michigan, University of California, Irvine
The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
The methodology employed in Sona is innovative, utilizing a target-conditioned neural pipeline that allows for real-time attenuation of multiple overlapping sound sources. This is a significant advancement over existing systems that typically focus on single-target noise cancellation. The incorporation of user-extensible sound classes through in-situ examples without the need for retraining is a notable feature that enhances user personalization and adaptability. The formative study involving 68 noise-sensitive individuals provides a solid foundation for understanding user needs and preferences, which is crucial for the design of the system.
The experimental evaluation is robust, featuring both technical benchmarking and an in-situ study with 10 participants. The results demonstrate low-latency performance and effective sound attenuation while preserving desired audio, which is critical for maintaining situational awareness. The use of subjective measures to assess user comfort and soundscape mediation effectiveness adds credibility to the findings. However, the small sample size in the in-situ study may limit the generalizability of the results.
The paper does not provide explicit details regarding the implementation or access to the code, which raises concerns about reproducibility. While the methodology is described, without a publicly available implementation or detailed algorithmic descriptions, it may be challenging for other researchers to replicate the results or build upon this work.
One limitation is the small participant size in the in-situ study, which may not adequately represent the broader population of noise-sensitive individuals. Additionally, while the system allows for user-defined sound classes, the effectiveness of the system in highly dynamic or complex sound environments remains to be fully evaluated. There may also be challenges in the real-world application of the technology, such as varying user preferences and environmental conditions.
The potential applications of Sona are significant, particularly for individuals with noise sensitivity, including those with neurodivergent conditions. By enabling users to manage their auditory environments, Sona could enhance comfort and social participation, leading to improved quality of life. The implications extend beyond personal use, as the technology could be adapted for various settings, including workplaces, educational environments, and public spaces. The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Primary: unknown
All Institutions: unknown
FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
The methodology presented in FineLAP is innovative, addressing the challenge of heterogeneous supervision in audio-language models. The introduction of a dual-stream sigmoid loss and a decoupled audio projector allows for effective learning from both clip- and frame-level annotations. This approach is well-justified, as it leverages the strengths of existing models while introducing novel components that enhance performance across various tasks. The use of cluster-based sampling for negative phrases is particularly noteworthy, as it mitigates the scarcity of frame-level annotations and improves the model's ability to generalize.
The experiments conducted are extensive and demonstrate the effectiveness of FineLAP across multiple audio understanding tasks, achieving state-of-the-art results. The evaluation includes a variety of benchmarks, and the ablation studies provide clear insights into the contributions of each component of the model. The results are compelling, showing significant improvements over existing methods, particularly in sound event detection and audio-text retrieval.
The paper provides sufficient implementation details, including training parameters and dataset descriptions, which are crucial for reproducibility. The authors also commit to releasing the code and dataset, which enhances the potential for other researchers to replicate and build upon their work.
Despite its strengths, FineLAP has limitations, such as its inability to handle variable-length audio inputs, which restricts its applicability in scenarios requiring long-form audio processing. Additionally, the focus on sound event detection may overlook other temporally grounded tasks, indicating areas for future exploration.
The advancements made in FineLAP have significant implications for audio understanding and multimodal learning, particularly in applications such as automated audio captioning, sound event detection, and audio editing. The model's ability to leverage heterogeneous data could lead to more robust and flexible audio-language systems, potentially benefiting various industries, including entertainment, accessibility, and security. FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.
Primary: College of Innovation and Technology, University of Michigan-Flint
All Institutions: College of Innovation and Technology, University of Michigan-Flint
The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
The proposed TRACE framework introduces a novel approach to detecting partial audio deepfakes without the need for training or labeled data. By analyzing the first-order dynamics of frozen speech foundation model representations, the methodology cleverly leverages the inherent properties of genuine speech versus manipulated audio. This is a significant departure from traditional supervised methods, showcasing a fresh perspective on audio forensics. However, the paper could benefit from a more detailed explanation of the embedding trajectory analysis and its computational efficiency.
The experiments are well-structured, evaluating TRACE on four benchmarks across two languages and using six different speech foundation models. The results demonstrate competitive performance against fine-tuned supervised baselines, particularly in challenging scenarios like LlamaPartialSpoof. However, the paper lacks comprehensive details on the datasets used, such as their sizes and the specific characteristics of the audio samples, which would enhance the understanding of the evaluation's robustness.
The paper does not provide sufficient details regarding the implementation of TRACE, such as the specific configurations of the speech foundation models used or the exact procedures for embedding trajectory analysis. This lack of detail may hinder reproducibility, as other researchers may struggle to replicate the results without clear guidelines or code availability.
One limitation is the reliance on the performance of existing speech foundation models, which may vary in quality and robustness. Additionally, while the training-free approach is innovative, it may not generalize well to all forms of audio manipulation beyond the tested benchmarks. The paper also does not address potential adversarial attacks against the proposed detection method.
The implications of TRACE are significant for the field of audio forensics, particularly in combating misinformation and enhancing the integrity of audio content. The training-free nature of the method could facilitate its adoption in real-world applications where rapid detection is critical, such as in media verification and security. However, further exploration of its applicability across diverse audio manipulation techniques is necessary. The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
Speech separation in realistic acoustic environments remains challenging because overlapping speakers, background noise, and reverberation must be resolved simultaneously. Although recent time-frequency (TF) domain models have shown strong performance, most still rely on late-split architectures, where speaker disentanglement is deferred to the final stage, creating an information bottleneck and weakening discriminability under adverse conditions. To address this issue, we propose SR-CorrNet, an asymmetric encoder-decoder framework that introduces the separation-reconstruction (SepRe) strategy into a TF dual-path backbone. The encoder performs coarse separation from mixture observations, while the weight-shared decoder progressively reconstructs speaker-discriminative features with cross-speaker interaction, enabling stage-wise refinement. To complement this architecture, we formulate speech separation as a structured correlation-to-filter problem: spatio-spectro-temporal correlations computed from the observations are used as input features, and the corresponding deep filters are estimated to recover target signals. We further incorporate an attractor-based dynamic split module to adapt the number of output streams to the actual speaker configuration. Experimental results on WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS demonstrate consistent improvements across anechoic, noisy-reverberant, and real-recorded conditions in both single- and multi-channel settings, highlighting the effectiveness of TF-domain SepRe with correlation-based filter estimation for speech separation.
Primary: Sogang University
All Institutions: Sogang University, Institute of Information and Communications Technology Planning and Evaluation (IITP), National Research Foundation of Korea (NRF)
The main contribution of this work is the introduction of SR-CorrNet, a novel asymmetric encoder-decoder framework that improves speech separation in complex acoustic environments by leveraging spatio-spectro-temporal correlations and a dynamic split module. This research significantly advances the state-of-the-art in speech separation, providing a robust solution for real-world applications.
The proposed SR-CorrNet framework introduces a novel asymmetric encoder-decoder architecture that effectively addresses the limitations of late-split designs in speech separation tasks. By employing a separation-reconstruction strategy and a correlation-to-filter paradigm, the methodology enhances speaker discrimination and robustness in challenging acoustic environments. The incorporation of spatio-spectro-temporal correlations as input features is a significant advancement, allowing the model to leverage temporal and spatial dependencies more effectively. The dynamic split module further enhances the model's adaptability to varying speaker counts, which is crucial for real-world applications.
The experiments conducted on multiple datasets (WSJ0-2/3/4/5Mix, WHAMR!, and LibriCSS) demonstrate the effectiveness of the proposed method across different conditions, including anechoic, noisy-reverberant, and real-recorded environments. The results show consistent improvements over existing models, indicating the robustness and generalizability of SR-CorrNet. The use of objective metrics like SI-SNRi and SDRi provides a solid basis for evaluating performance, although subjective evaluations could further strengthen the findings.
The paper provides detailed descriptions of the model architecture, training procedures, and datasets used, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the ability for other researchers to replicate the experiments directly. Clearer documentation or supplementary materials could enhance reproducibility.
One limitation of the study is the lack of subjective evaluation metrics, such as human listening tests, which could provide insights into the perceptual quality of the separated audio. Additionally, while the dynamic split module shows promise, its performance in highly variable acoustic environments needs further validation. The model's complexity may also pose challenges in real-time applications.
The advancements in speech separation technology have significant implications for various applications, including automatic speech recognition, hearing aids, and communication systems in noisy environments. The ability to effectively separate overlapping speech can enhance the user experience in real-world scenarios, making this research highly relevant to both academia and industry. The main contribution of this work is the introduction of SR-CorrNet, a novel asymmetric encoder-decoder framework that improves speech separation in complex acoustic environments by leveraging spatio-spectro-temporal correlations and a dynamic split module. This research significantly advances the state-of-the-art in speech separation, providing a robust solution for real-world applications.
Large Audio Language Models (LALMs) achieve strong performance on audio-language tasks; however, their reliability in real-world settings remains underexplored. We introduce Audio Hallucination Attacks (AHA), an attack suite called AHA-Eval, comprising 6.5K QA pairs designed to test whether LALMs genuinely ground their responses in the audio input. AHA targets two attack surfaces: (i) query-based attacks, which exploit question structure to induce hallucinations about absent sounds, and (ii) audio-based attacks, which inject synthetic speech describing non-existent events into the audio stream. Evaluating state-of-the-art LALMs, including Audio Flamingo 3 and Gemini 3 Pro, we observe high attack success rates of 95.35% and 79.65%, respectively, revealing a reliability gap that is hidden by standard benchmark performance. To mitigate this, we propose a 120K QA post-alignment dataset, AHA-Guard, which successfully reduces attack success rates by up to 49%.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park
The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
The methodology is robust, introducing a novel attack suite (AHA-Eval) that effectively evaluates the reliability of Large Audio Language Models (LALMs) through a systematic approach. The dual focus on query-based and audio-based attacks is particularly insightful, allowing for a comprehensive assessment of model vulnerabilities. The data curation and filtering process is well-structured, ensuring high-quality inputs for the evaluation. The use of LLMs for generating hallucinated sounds and the distinction between explicit and implicit queries are innovative contributions that enhance the depth of the analysis.
The experimental setup is thorough, evaluating multiple state-of-the-art LALMs and providing clear metrics for attack success rates. The results demonstrate significant vulnerabilities in these models, with high ASR values indicating a pressing need for improved grounding mechanisms. The comparison of mitigation strategies, particularly the effectiveness of AHA-Guard, is a valuable addition that highlights practical implications for enhancing model reliability.
The paper provides sufficient detail regarding the experimental setup, including model selection and training procedures, which aids reproducibility. However, the absence of publicly accessible datasets or code limits the ease with which other researchers can replicate the study. Future work should consider releasing the datasets and methodologies used for generating AHA-Eval and AHA-Guard.
One limitation is the reliance on specific LALMs for generating hallucinated sounds, which may not generalize across all audio-language models. Additionally, while the evaluation metrics are well-defined, the subjective nature of audio perception may introduce variability in human assessments that are not fully addressed. The paper also does not explore the long-term implications of these vulnerabilities in real-world applications.
The findings have significant implications for the deployment of LALMs in practical applications, particularly in fields such as automated transcription, audio description, and interactive voice response systems. By highlighting the reliability gaps in these models, the research encourages the development of more robust audio grounding techniques, ultimately enhancing the safety and trustworthiness of AI systems in audio processing. The paper introduces Audio Hallucination Attacks (AHA), a framework for evaluating audio hallucinations in LALMs through innovative query-based and audio-based attack methodologies. This work is significant as it not only identifies critical vulnerabilities in state-of-the-art models but also proposes effective mitigation strategies, paving the way for more reliable audio-language models in real-world applications.
We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.
Primary: Meituan LongCat Team
All Institutions: Meituan LongCat Team
LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
The methodology presented in LongCat-AudioDiT is innovative, particularly in its non-autoregressive diffusion-based approach to text-to-speech synthesis. By operating directly in the waveform latent space rather than relying on intermediate representations like mel-spectrograms, the authors have simplified the TTS pipeline significantly. The introduction of adaptive projection guidance to replace traditional classifier-free guidance is a noteworthy advancement that enhances generation quality. The paper also addresses a critical training-inference mismatch, showcasing a thoughtful approach to improving model performance. Overall, the methodology is robust and well-structured, with clear innovations that set it apart from existing models.
The experimental evaluation is thorough, with the authors providing comprehensive results that demonstrate the effectiveness of LongCat-AudioDiT. The paper reports state-of-the-art performance on the Seed benchmark for zero-shot voice cloning, with significant improvements in speaker similarity scores. The use of ablation studies to validate the proposed modules adds credibility to the findings. However, the absence of high-quality human-annotated datasets may limit the generalizability of the results, although the authors mitigate this by achieving competitive intelligibility.
The authors mention that code and model weights are released, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed implementation guidelines and hyperparameter settings to facilitate easier replication of the results by other researchers.
One limitation identified is the reliance on a single benchmark (Seed) for evaluation, which may not fully capture the model's performance across diverse TTS tasks. Additionally, the findings regarding the Wav-VAE's reconstruction fidelity not correlating with TTS performance could indicate a need for further exploration into the underlying mechanisms affecting performance.
The potential applications of LongCat-AudioDiT are significant, particularly in areas requiring high-fidelity speech synthesis, such as virtual assistants, audiobooks, and voice cloning technologies. The model's ability to operate without complex multi-stage training pipelines could democratize access to high-quality TTS systems, fostering innovation in various industries. LongCat-AudioDiT presents a significant advancement in text-to-speech synthesis through its innovative approach in the waveform latent space and the introduction of adaptive projection guidance. The comprehensive experimental results and the release of code and model weights contribute to its potential impact on the field, although further exploration of its limitations and broader applicability is warranted.
MambaVoiceCloning (MVC) asks whether the conditioning path of diffusion-based TTS can be made fully SSM-only at inference, removing all attention and explicit RNN-style recurrence layers across text, rhythm, and prosody, while preserving or improving quality under controlled conditions. MVC combines a gated bidirectional Mamba text encoder, a Temporal Bi-Mamba supervised by a lightweight alignment teacher discarded after training, and an Expressive Mamba with AdaLN modulation, yielding linear-time O(T) conditioning with bounded activation memory and practical finite look-ahead streaming. Unlike prior Mamba-TTS systems that remain hybrid at inference, MVC removes attention-based duration and style modules under a fixed StyleTTS2 mel-diffusion-vocoder backbone. Trained on LJSpeech/LibriTTS and evaluated on VCTK, CSS10 (ES/DE/FR), and long-form Gutenberg passages, MVC achieves modest but statistically reliable gains over StyleTTS2, VITS, and Mamba-attention hybrids in MOS/CMOS, F0 RMSE, MCD, and WER, while reducing encoder parameters to 21M and improving throughput by 1.6x. Diffusion remains the dominant latency source, but SSM-only conditioning improves memory footprint, stability, and deployability.
Primary: Chuzhou University
All Institutions: Chuzhou University, Yeshiva University
The main contribution of this paper is the introduction of MambaVoiceCloning, a novel TTS system that utilizes a fully state-space model conditioning path to improve efficiency and stability while eliminating traditional attention mechanisms. This work significantly advances the field of TTS by addressing key limitations in existing architectures, offering a promising direction for future research and applications.
The methodology of MambaVoiceCloning (MVC) is innovative in its approach to eliminate attention and recurrent layers in a text-to-speech (TTS) system, utilizing a fully state-space model (SSM) conditioning path. The integration of gated bidirectional Mamba text encoders, Temporal Bi-Mamba, and Expressive Mamba with AdaLN modulation showcases a significant architectural redesign that emphasizes efficiency and stability. The linear-time complexity and bounded activation memory are particularly noteworthy, as they address common issues in traditional TTS systems, such as memory pressure and drift in long sequences. The paper provides a clear explanation of the architecture and its components, supported by rigorous theoretical grounding.
The experimental evaluation is comprehensive, utilizing multiple datasets including LJSpeech, LibriTTS, VCTK, and CSS10, which allows for a robust assessment of MVC's performance across various conditions. The paper reports both subjective (MOS, CMOS) and objective metrics (F0 RMSE, MCD, WER), demonstrating statistically significant improvements over baseline models. The inclusion of long-form and cross-lingual evaluations further strengthens the findings, showcasing the model's generalization capabilities. However, while the improvements are statistically reliable, they are described as modest, indicating room for further enhancement.
The authors provide detailed implementation and training protocols, ensuring that the methodology can be reproduced. The use of a unified optimization schedule across all models and the provision of code on GitHub enhances reproducibility. However, the paper could benefit from more explicit details regarding hyperparameter tuning and the specific configurations used for each model.
The paper acknowledges limitations such as the focus on conditioning efficiency over fine-grained emotion control, and the model's training solely on English datasets, which may affect its performance in multilingual contexts. Additionally, the diffusion decoder remains the primary latency bottleneck, which could hinder real-time applications.
The MVC framework has potential implications for real-time TTS applications, particularly in scenarios requiring efficient memory usage and low latency. Its architecture could serve as a drop-in replacement for existing TTS systems, enhancing their deployability in resource-constrained environments. The focus on ethical considerations, such as watermarking and speaker consent, is commendable and highlights the responsible deployment of AI technologies. The main contribution of this paper is the introduction of MambaVoiceCloning, a novel TTS system that utilizes a fully state-space model conditioning path to improve efficiency and stability while eliminating traditional attention mechanisms. This work significantly advances the field of TTS by addressing key limitations in existing architectures, offering a promising direction for future research and applications.
RuASD (Russian AntiSpoofing Dataset) is a dedicated, reproducible benchmark for Russian-language speech anti-spoofing designed to evaluate both in-domain discrimination and robustness to deployment-style distribution shifts. It combines a large spoof subset synthesized using 37 modern Russian-capable TTS and voice-cloning systems with a bona fide subset curated from multiple heterogeneous open Russian speech corpora, enabling systematic evaluation across diverse data sources. To emulate typical dissemination and channel effects in a controlled and reproducible manner, RuASD includes configurable simulations of platform and transmission distortions, including room reverberation, additive noise/music, and a range of speech-codec transcodings implemented via a unified processing chain. We benchmark a diverse set of publicly available anti-spoofing countermeasures spanning lightweight supervised architectures, graph-attention models, SSL-based detectors, and large-scale pretrained systems, and report reference results on both clean and simulated conditions to characterize robustness under realistic perturbation pipelines. The dataset is publickly available at \href{https://huggingface.co/datasets/MTUCI/RuASD}{\underline{Hugging Face}} and \href{https://modelscope.cn/datasets/lab260/RuASD}{\underline{ModelScope}}.
Primary: Moscow Technical University of Communications and Informatics
All Institutions: Moscow Technical University of Communications and Informatics
The main contribution of this paper is the introduction of the RuASD dataset and benchmark for evaluating Russian-language speech anti-spoofing systems. This work is significant as it addresses a critical gap in the literature by providing a comprehensive and reproducible framework for assessing the robustness of anti-spoofing technologies in a language that is often underrepresented in the research community.
The paper presents a comprehensive methodology for evaluating anti-spoofing systems in the Russian language, which includes the creation of the RuASD dataset. This dataset is notable for its large size and diversity, combining synthetic and authentic speech samples, and simulating various channel distortions. The inclusion of configurable simulations for platform and transmission distortions is a significant methodological advancement, allowing for more realistic testing conditions. The benchmarking of various anti-spoofing models, including lightweight architectures and large-scale pretrained systems, is well-structured and provides a solid foundation for comparison.
The experimental setup is thorough, with a clear delineation of the conditions under which the anti-spoofing systems were evaluated. The results are reported on both clean and simulated conditions, which is crucial for understanding the robustness of the models. The paper effectively characterizes the performance of different models under realistic perturbation pipelines, providing valuable insights into their generalization capabilities.
The authors have made the RuASD dataset publicly available, which is a positive step towards reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameters and specific configurations used for each model. This would enhance the ability of other researchers to replicate the results.
One limitation is the focus on Russian-language speech, which may restrict the applicability of the findings to other languages and dialects. Additionally, while the dataset includes a variety of distortions, it may not encompass all possible real-world scenarios, potentially limiting the generalizability of the results. The paper does not address the computational resources required for training the evaluated models, which could be a barrier for some researchers.
The RuASD initiative has the potential to significantly impact the field of speech anti-spoofing, particularly in Russian-speaking regions. By providing a robust benchmark and dataset, it encourages further research and development of more effective anti-spoofing technologies. The methods and findings could also inform practices in related fields such as voice recognition and security systems. The main contribution of this paper is the introduction of the RuASD dataset and benchmark for evaluating Russian-language speech anti-spoofing systems. This work is significant as it addresses a critical gap in the literature by providing a comprehensive and reproducible framework for assessing the robustness of anti-spoofing technologies in a language that is often underrepresented in the research community.
The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multimodal dialogue. While current methods impressively generate realistic dialogue in isolated modalities like speech or vision, challenges remain in controllable Multimodal Dialogue Generation (MDG). This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming for expressive dialogue generation through multimodal conditional control. To address the insufficient richness and diversity of dialogue expressiveness in existing datasets, we introduce a novel multimodal dialogue annotation pipeline to curate dialogues from movies and TV series with fine-grained annotations in interactional characteristics. The resulting MM-Dia dataset (360+ hours, 54,700 dialogues) facilitates explicitly controlled MDG, specifically through style-controllable dialogue speech synthesis. In parallel, MM-Dia-Bench (309 highly expressive dialogues with visible single-/dual-speaker scenes) serves as a rigorous testbed for implicit cross-modal MDG control, evaluating audio-visual style consistency across modalities. Extensive experiments demonstrate that training on MM-Dia significantly enhances fine-grained controllability, while evaluations on MM-Dia-Bench reveal limitations in current frameworks to replicate the nuanced expressiveness of human interaction. These findings provides new insights and challenges for multimodal conditional dialogue generation.
Primary: Unknown
All Institutions: Unknown
This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.
The paper introduces a novel multimodal dialogue annotation pipeline that curates dialogues from movies and TV series with fine-grained annotations. This approach is significant as it addresses the limitations of existing datasets in terms of expressiveness and diversity. The methodology for generating the MM-Dia dataset and the MM-Dia-Bench testbed is well-articulated, focusing on both explicit and implicit cross-modal control. However, the paper could benefit from a more detailed explanation of the annotation process and the specific criteria used for dialogue selection.
The experiments conducted demonstrate the effectiveness of the MM-Dia dataset in enhancing controllability in multimodal dialogue generation. The evaluation metrics used, while not explicitly detailed in the abstract, are crucial for assessing the performance of the proposed models. The results indicate that current frameworks struggle to replicate the nuanced expressiveness of human interaction, highlighting an important area for future research. However, the paper could improve by providing more comprehensive quantitative results and comparisons with baseline models.
The paper does not provide sufficient details on the implementation of the models or the datasets used, which raises concerns about reproducibility. Clearer guidelines or links to supplementary materials would enhance the ability of other researchers to replicate the findings.
One significant limitation is the reliance on dialogue from movies and TV series, which may not fully capture the diversity of real-world interactions. Additionally, the paper acknowledges limitations in current frameworks to replicate human expressiveness, suggesting that further work is needed to bridge this gap.
The findings of this research have the potential to significantly impact the field of multimodal dialogue systems, particularly in applications such as virtual assistants, interactive storytelling, and entertainment. By improving controllability and expressiveness in dialogue generation, this work could lead to more engaging and human-like interactions in AI systems. This paper presents a significant advancement in multimodal dialogue generation by introducing a comprehensive dataset and evaluation framework that enhances controllability and expressiveness. The methodology and experimental results provide valuable insights into the challenges of replicating human interaction in AI-generated dialogue, paving the way for future research in this area.