Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Primary: German Research Center for Artificial Intelligence (DFKI)
All Institutions: German Research Center for Artificial Intelligence (DFKI), University of Stuttgart, National Institute of Informatics, Technical University of Berlin
The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The methodology presented in DeepFense is robust and well-structured, focusing on creating a modular and extensible framework for deepfake audio detection. The use of a configuration-driven design allows for easy experimentation and reproducibility, which is a significant advancement in the field. The integration of over 400 models and 100 recipes enhances the toolkit's utility for researchers. The modular architecture facilitates the isolation of algorithmic innovations from implementation artifacts, which is critical for accurate benchmarking.
The experimental evaluation is extensive, covering a large-scale comparison of 400 models across 13 datasets, which is a notable strength of the paper. The results provide valuable insights into the impact of front-end feature extractors, back-end architectures, and training datasets on model performance. The findings regarding biases in model performance based on audio quality, speaker gender, and language are particularly important for ensuring equitable AI systems.
The paper emphasizes reproducibility through its open-source nature and the provision of a comprehensive toolkit that allows other researchers to replicate experiments easily. The use of a single YAML file for experiment configuration is a strong point, as it simplifies the process of sharing and reproducing results.
While the paper presents a significant advancement, it acknowledges limitations such as the lack of a multi-dataset training pipeline and the focus solely on detection tasks. These limitations suggest areas for future research, including the need for more comprehensive training strategies that can mitigate biases.
The implications of this work are substantial, particularly in the context of increasing concerns about deepfake technology and its potential misuse. By providing a standardized toolkit for deepfake detection, DeepFense can help improve the robustness of systems used in real-world applications, thereby enhancing security and trust in voice biometric systems. The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Primary: CUHK MMLab
All Institutions: CUHK MMLab
The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
The AURA framework presents a comprehensive end-to-end approach for real-time video understanding and interaction. It effectively integrates context management and data construction, which are crucial for maintaining continuity in long-horizon interactions. The methodology is well-structured, addressing the limitations of existing VideoLLMs by providing a unified model that supports both real-time question answering and proactive responses. The incorporation of ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems at a reasonable frame rate demonstrates a practical application of the proposed methods.
The experiments conducted show that AURA achieves state-of-the-art performance on relevant streaming benchmarks, which is a significant accomplishment. The evaluation metrics used to assess performance should ideally include both subjective and objective measures to provide a comprehensive view of the model's capabilities. However, the paper could benefit from a more detailed breakdown of the datasets used and their characteristics, as well as comparisons with other contemporary systems.
The paper mentions the release of the AURA model and a real-time inference framework, which is a positive step towards reproducibility. However, further details regarding the training process, hyperparameters, and the specific configurations used in experiments would enhance reproducibility efforts. Clear documentation and access to code would be essential for other researchers to replicate the findings.
One limitation is the reliance on specific hardware (80G accelerators) for achieving the reported performance, which may not be accessible to all researchers. Additionally, while the system is designed for real-time interaction, the practical implications of latency and response times in diverse real-world scenarios are not fully explored. The paper could also discuss potential biases in the data or limitations in the model's understanding of complex interactions.
AURA has significant potential applications in various fields, including education, healthcare, and entertainment, where real-time video interaction is valuable. By enabling continuous observation and interaction, it could enhance user experiences in virtual environments and assistive technologies. The release of the model and framework could foster further research and development in real-time video understanding systems. The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.
Primary: National Taiwan University
All Institutions: National Taiwan University, NVIDIA Taiwan
The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
The paper introduces a novel joint fullband-subband modeling framework, Sing-HiResNet, which effectively captures both global and localized spectral features for high-resolution SingFake detection. The methodology is well-structured, employing a two-phase approach that integrates fullband and subband models, and explores various fusion strategies to enhance detection performance. The use of high-resolution audio (44.1 kHz) is a significant advancement over conventional methods, and the systematic evaluation of subband contributions adds depth to the methodology. However, the paper could benefit from clearer explanations of the fusion strategies and their implications.
The experiments are robust, utilizing the WildSVDD dataset to benchmark the proposed method against existing state-of-the-art systems. The results demonstrate a significant performance improvement over traditional 16 kHz models, achieving a state-of-the-art EER of 1.58%. The comparative analysis of different fusion strategies provides valuable insights into the effectiveness of the proposed approach. However, the paper lacks detailed statistical analysis of the results, which would strengthen the findings.
The paper provides a comprehensive description of the experimental setup, including dataset preparation, model architecture, and training procedures. However, it lacks a public code repository or demo URL, which would enhance reproducibility. The absence of shared resources limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a single dataset (WildSVDD), which may not fully capture the diversity of real-world singing voice deepfakes. Additionally, while the paper discusses various fusion strategies, it does not explore the computational efficiency of these methods, which could be a concern for real-time applications. The authors could also provide more insights into the potential impact of noise and other artifacts in the audio data.
The research addresses a critical issue in the realm of audio synthesis and deepfake detection, with implications for copyright protection, content authenticity, and the broader field of audio forensics. The findings could inform future developments in anti-spoofing technologies and contribute to the establishment of standards for audio quality evaluation in deepfake detection. The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Primary: German Research Center for Artificial Intelligence (DFKI)
All Institutions: German Research Center for Artificial Intelligence (DFKI), University of Stuttgart, National Institute of Informatics, Technical University of Berlin
The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The methodology presented in DeepFense is robust and well-structured, focusing on creating a modular and extensible framework for deepfake audio detection. The use of a configuration-driven design allows for easy experimentation and reproducibility, which is a significant advancement in the field. The integration of over 400 models and 100 recipes enhances the toolkit's utility for researchers. The modular architecture facilitates the isolation of algorithmic innovations from implementation artifacts, which is critical for accurate benchmarking.
The experimental evaluation is extensive, covering a large-scale comparison of 400 models across 13 datasets, which is a notable strength of the paper. The results provide valuable insights into the impact of front-end feature extractors, back-end architectures, and training datasets on model performance. The findings regarding biases in model performance based on audio quality, speaker gender, and language are particularly important for ensuring equitable AI systems.
The paper emphasizes reproducibility through its open-source nature and the provision of a comprehensive toolkit that allows other researchers to replicate experiments easily. The use of a single YAML file for experiment configuration is a strong point, as it simplifies the process of sharing and reproducing results.
While the paper presents a significant advancement, it acknowledges limitations such as the lack of a multi-dataset training pipeline and the focus solely on detection tasks. These limitations suggest areas for future research, including the need for more comprehensive training strategies that can mitigate biases.
The implications of this work are substantial, particularly in the context of increasing concerns about deepfake technology and its potential misuse. By providing a standardized toolkit for deepfake detection, DeepFense can help improve the robustness of systems used in real-world applications, thereby enhancing security and trust in voice biometric systems. The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group, Chinese Academy of Sciences, Beijing Institute of Technology, Shanghai Jiao Tong University
The main contribution of this paper is the establishment of the AT-ADD Grand Challenge, which aims to enhance the robustness and generalization of audio deepfake detection systems across various audio types, thereby addressing critical gaps in current methodologies. This initiative is significant for its potential to drive forward research in audio forensics and improve the reliability of detection technologies in real-world applications.
The paper proposes a comprehensive evaluation framework for audio deepfake detection that includes two distinct tracks focusing on robustness in speech detection and generalization across all audio types. The methodology is well-structured, providing detailed descriptions of datasets, evaluation metrics, and baseline models, which are essential for fostering competitive research in the field. The emphasis on real-world applicability and the inclusion of diverse audio types are notable strengths.
The experimental design is robust, with extensive datasets constructed for both tracks, ensuring a fair evaluation of detection methods. The paper outlines the composition of training, development, and evaluation datasets, as well as the metrics used for performance assessment, which enhances the credibility of the challenge. However, specific results from preliminary experiments are not presented, which could have strengthened the evaluation.
The paper emphasizes reproducibility by providing standardized datasets and baseline models, along with clear rules for participation in the challenge. However, the lack of detailed implementation specifics for the proposed models limits the ability for external researchers to replicate results fully.
The paper does not address potential biases in the datasets or the limitations of the proposed methods in handling extreme variations in audio quality or types beyond those specified. Additionally, the challenge's closed setting may restrict innovation by limiting the use of external data.
The proposed AT-ADD challenge has the potential to significantly advance the field of audio deepfake detection by encouraging the development of more robust and generalizable detection systems. This is crucial in an era where synthetic audio poses increasing security and trust challenges. The focus on diverse audio types also opens avenues for research in multimedia forensics and secure communication. The main contribution of this paper is the establishment of the AT-ADD Grand Challenge, which aims to enhance the robustness and generalization of audio deepfake detection systems across various audio types, thereby addressing critical gaps in current methodologies. This initiative is significant for its potential to drive forward research in audio forensics and improve the reliability of detection technologies in real-world applications.
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Hello Group Inc.
The paper presents CapTalk, a novel framework for voice design that significantly enhances dialogue speech generation capabilities. The comprehensive evaluation and innovative methodologies contribute meaningfully to the field of controllable speech synthesis, addressing key challenges in expressive and context-aware voice generation.
The paper introduces CapTalk, a unified caption-conditioned text-audio autoregressive framework that innovatively extends voice design to both single-utterance and dialogue settings. The methodology includes a hierarchical variational conditioning module that effectively balances timbre preservation and contextual adaptation, which is a significant advancement over existing methods that primarily focus on single-utterance generation. The use of CoT control sequences for turn-level expressive control in dialogue is particularly noteworthy, as it allows for dynamic adjustments based on conversational context.
The experiments are comprehensive, demonstrating state-of-the-art performance on benchmarks for single-utterance voice design and improved expression controllability in dialogue settings. The authors employ both automatic and human evaluations, which adds robustness to their findings. The detailed evaluation protocol for dialogue generation is a valuable contribution, addressing gaps in existing benchmarks.
The paper lacks detailed implementation specifics that would enhance reproducibility, such as hyperparameters, training procedures, and data preprocessing steps. While the architecture is described, additional details on the training setup would be beneficial for other researchers looking to replicate or build upon this work.
The reliance on the quality of captions generated by Qwen3-Omni could introduce biases or inaccuracies, affecting the overall performance of the model. Additionally, the training data's focus on conversational speech may limit the model's expressive range compared to acted-style speech, which could be addressed in future work.
The advancements in voice design through CapTalk have the potential to significantly enhance human-computer interaction, making conversational agents more expressive and context-aware. This could lead to more natural and engaging user experiences in applications such as virtual assistants, gaming, and interactive storytelling. The paper presents CapTalk, a novel framework for voice design that significantly enhances dialogue speech generation capabilities. The comprehensive evaluation and innovative methodologies contribute meaningfully to the field of controllable speech synthesis, addressing key challenges in expressive and context-aware voice generation.
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
Primary: Muséum National d'Histoire Naturelle
All Institutions: Muséum National d'Histoire Naturelle, Sebitoli Chimpanzee Project, Uganda Wildlife Authority, Nitidae Association, Centre d'Ecologie et des Sciences de la Conservation, Institut de Systématique, Evolution, Biodiversité
The main contribution of this paper is the development of DeepForestSound (DFS), a multi-species automatic detection model that significantly enhances the capabilities of passive acoustic monitoring in African tropical forests. The innovative use of semi-supervised learning and LoRA-based fine-tuning addresses critical challenges in biodiversity monitoring, particularly for underrepresented taxa, thereby advancing the field of ecoacoustics and conservation technology.
The methodology employed in this study is robust and innovative, leveraging a semi-supervised clustering approach to generate labeled datasets from unannotated acoustic recordings. The use of Low-Rank Adaptation (LoRA) for fine-tuning the Audio Spectrogram Transformer (AST) is particularly noteworthy, as it allows for efficient adaptation to the specific acoustic characteristics of the target taxa in a data-scarce environment. The detailed description of the data collection process, including ethical considerations and the integration of multiple datasets, enhances the credibility of the study. However, the absence of a systematic sensitivity analysis for hyperparameters and the lack of an ablation study to isolate the contributions of different components are notable gaps.
The experimental evaluation is comprehensive, with a clear focus on assessing the model's performance across various taxa, particularly in the context of non-avian species where existing models typically underperform. The results demonstrate that DFS outperforms baseline models, particularly for primates and elephants, which are often neglected in general-purpose ecoacoustic models. The use of Average Precision (AP) and best F1 scores as evaluation metrics is appropriate for the task. However, the evaluation is limited to a single ecosystem, which may affect the generalizability of the findings.
The paper provides sufficient details regarding the training process, data preprocessing, and model architecture, which supports reproducibility. The code and pretrained models are made publicly available, which is a positive aspect for the research community. However, the specific configurations for hyperparameters and augmentation strategies could benefit from clearer documentation to facilitate replication.
The study has several limitations, including the focus on a single geographical area, which may restrict the applicability of the model to other tropical forest ecosystems. Additionally, while the model shows strong performance for the selected taxa, its ability to generalize to other species or soundscapes remains untested. The reliance on manual validation for the semi-supervised pipeline may introduce biases or inconsistencies in the labeled data.
The potential applications of DFS are significant, particularly in conservation efforts aimed at monitoring endangered species in tropical forests. By providing a practical tool for biodiversity assessment, DFS could facilitate more effective conservation strategies and contribute to the understanding of ecosystem dynamics. The study highlights the importance of tailored machine learning approaches in addressing specific ecological challenges, which could inspire further research in similar contexts. The main contribution of this paper is the development of DeepForestSound (DFS), a multi-species automatic detection model that significantly enhances the capabilities of passive acoustic monitoring in African tropical forests. The innovative use of semi-supervised learning and LoRA-based fine-tuning addresses critical challenges in biodiversity monitoring, particularly for underrepresented taxa, thereby advancing the field of ecoacoustics and conservation technology.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
Primary: NIO
All Institutions: NIO
The paper presents a principled framework for optimizing LLM-based ASR systems through entropy allocation, significantly enhancing performance while mitigating hallucinations. The comprehensive methodology and robust experimental results position this work as a meaningful advancement in the intersection of speech recognition and language modeling.
The paper introduces a novel perspective on entropy allocation in LLM-based ASR systems, proposing a multi-stage training paradigm that emphasizes capability-boundary awareness. The methodology is well-grounded in theoretical insights, particularly the use of entropy metrics (NSE, PAI, CSAI) to diagnose and optimize the interaction between speech encoders and LLMs. The iterative asynchronous SFT (IA-SFT) stage is a significant innovation that mitigates representation drift and enhances model robustness against hallucinations. The approach is systematic and addresses key challenges in ASR, such as efficiency and accuracy, making it a valuable contribution to the field.
The experiments are comprehensive, covering multiple benchmarks in both Mandarin and English, and demonstrate competitive performance against state-of-the-art models with significantly fewer parameters. The evaluation metrics (CER and WER) are appropriate for the tasks, and the results indicate that the proposed method not only achieves high accuracy but also effectively reduces hallucination rates. The empirical analysis of metric dynamics throughout the training stages provides strong evidence supporting the claims made about the benefits of the proposed methodology.
The paper provides detailed training setups, including data statistics, model architectures, and training configurations. However, the lack of publicly available code or a demo URL limits reproducibility. Future work could benefit from sharing the model and training scripts to allow other researchers to validate and build upon these findings.
While the paper presents a robust framework, it does not address the scalability of the proposed methods to larger datasets or more complex ASR tasks beyond the benchmarks used. Additionally, the reliance on specific metrics for entropy allocation may not capture all nuances of model performance in diverse real-world scenarios.
The findings have significant implications for the deployment of LLM-based ASR systems in industrial applications, particularly in enhancing efficiency and reducing operational costs. The approach could lead to more reliable speech recognition systems that are better suited for real-time applications, thereby improving user experiences across various domains such as customer service, transcription services, and accessibility technologies. The paper presents a principled framework for optimizing LLM-based ASR systems through entropy allocation, significantly enhancing performance while mitigating hallucinations. The comprehensive methodology and robust experimental results position this work as a meaningful advancement in the intersection of speech recognition and language modeling.
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, AISpeech Ltd, Nanjing University
The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that enhances speech LLM post-training through improved alignment and low-resource adaptation. This work presents a significant advancement in the field of speech recognition, particularly for scenarios lacking extensive audio-text pairs, and demonstrates the potential for more effective and efficient training methodologies.
The methodology of TASU2 is well-structured, introducing a novel framework for generating controllable CTC posteriors from text. The use of WER conditioning to simulate CTC posteriors is innovative, addressing the limitations of previous methods like TASU. The architecture leverages a Transformer-based model, which is appropriate for the task, and the training process is clearly defined, including the use of distribution-level supervision. The paper effectively bridges the gap between text-derived supervision and acoustic decoding, enhancing the potential for low-resource adaptation.
The experiments are comprehensive, covering various adaptation scenarios and demonstrating clear improvements over the baseline TASU and other strong methods. The evaluation metrics are appropriate, and the results are presented clearly, showing consistent gains in both in-domain and out-of-domain recognition. However, the paper could benefit from more detailed comparisons with additional state-of-the-art methods to further validate the effectiveness of TASU2.
The paper provides sufficient details regarding the architecture and training process, which should allow for reproducibility. However, the absence of a publicly available code repository or demo limits the practical reproducibility of the results. Including a link to a GitHub repository or similar would enhance the paper's impact.
One limitation is the reliance on a single dataset (LibriSpeech) for training and evaluation, which may affect the generalizability of the results. Additionally, while the WER conditioning is a significant improvement, the method may still struggle with extreme cases of noise or distortion in real-world applications. The paper does not address potential scalability issues when applied to larger datasets or more complex tasks.
The proposed TASU2 framework has significant implications for low-resource speech recognition, particularly in domains where paired audio-text data is scarce. The ability to simulate CTC posteriors with controlled error rates could facilitate advancements in speech technology for various applications, including medical transcription and assistive technologies for individuals with speech impairments. This work could lead to more robust and adaptable speech recognition systems in real-world scenarios. The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that enhances speech LLM post-training through improved alignment and low-resource adaptation. This work presents a significant advancement in the field of speech recognition, particularly for scenarios lacking extensive audio-text pairs, and demonstrates the potential for more effective and efficient training methodologies.
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Primary: Unknown
All Institutions: Unknown
The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
The proposed TG-DP framework effectively decouples the objectives of masked reconstruction and contrastive learning into separate optimization paths. This dual-path approach allows for tailored visibility patterns that enhance cross-modal alignment while mitigating semantic noise and optimization interference. The introduction of a teacher-student mechanism further enriches the training process by providing structured guidance, which is a noteworthy advancement in the field. The methodology is well-structured and addresses existing challenges in audio-visual representation learning.
The experiments are comprehensive, utilizing large-scale datasets such as AudioSet-2M and VGGSound. The results demonstrate significant improvements in zero-shot retrieval performance, achieving state-of-the-art results across various metrics. The ablation studies provide valuable insights into the effectiveness of the proposed components, such as the dual-path structure and teacher-guided masking strategy. However, the paper could benefit from more detailed comparisons with additional baselines to further validate the claims.
The paper provides a clear description of the methodology and experimental setup, including hyperparameters and data preprocessing steps. The availability of code on GitHub enhances reproducibility. However, the lack of detailed information on the training environment and specific configurations may pose challenges for complete replication.
The primary limitation is the unknown primary institution and the lack of citation context, which may hinder the paper's visibility and impact in the academic community. Additionally, the performance improvements, while significant, may still be context-dependent and require further validation across diverse tasks and datasets.
The advancements in audio-visual representation learning have the potential to enhance various applications, including multimedia retrieval, content-based recommendation systems, and interactive AI systems. The proposed framework could lead to more robust models that understand and integrate audio-visual information, paving the way for future research and applications in multimodal AI. The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Primary: Shanghai Jiao Tong University
All Institutions: EPIC Lab, Shanghai Jiao Tong University, Xidian University, HKUST (GZ)
The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work represents a meaningful advancement in the field of audio processing, demonstrating innovative methodologies that address critical challenges in deploying large-scale models effectively.
The proposed methodology, AudioKV, innovatively addresses the inefficiencies of Key-Value (KV) cache management in Large Audio-Language Models (LALMs) by introducing a dual mechanism: audio-aware KV cache allocation and Spectral Score Smoothing (SSS). The former identifies and prioritizes audio-critical attention heads based on their relevance to acoustic modeling, while the latter employs a frequency-domain approach to stabilize importance score estimation. This dual approach is particularly effective in the audio domain, where temporal continuity is crucial, showcasing a thoughtful adaptation of existing techniques to a new modality.
The experiments conducted across various benchmarks, including Automatic Speech Recognition (ASR) and Speech Translation (ST), demonstrate that AudioKV significantly outperforms existing methods, particularly under aggressive compression scenarios. The results indicate not only improved accuracy but also enhanced robustness against performance degradation, which is critical for practical applications. The use of diverse datasets strengthens the validity of the findings, although the paper could benefit from more extensive comparisons with a broader range of state-of-the-art methods.
The paper mentions that the code will be released upon acceptance, which is a positive aspect for reproducibility. However, the lack of a demo URL or a project repository at this stage limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but actual code availability will be crucial for broader adoption and validation of the results.
One limitation is the potential for overfitting to specific datasets, as the performance improvements are primarily demonstrated on selected benchmarks. Additionally, while the method shows promise in maintaining accuracy at high compression ratios, the paper does not thoroughly explore the trade-offs involved in different compression strategies or the impact on latency and real-time processing capabilities.
The implications of this work extend to various applications in speech processing and multimodal AI systems, where efficient inference is paramount. By improving the efficiency of LALMs, this research could facilitate the deployment of advanced audio processing systems in resource-constrained environments, such as mobile devices or real-time applications. The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work represents a meaningful advancement in the field of audio processing, demonstrating innovative methodologies that address critical challenges in deploying large-scale models effectively.
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.
Primary: University of California San Diego
All Institutions: University of California San Diego
The main contribution of this paper is the development of a real-time human-AI musical co-performance system that effectively generates instrumental accompaniment using latent diffusion models, addressing critical latency challenges while maintaining musical coherence and quality. This work significantly advances the field of AI-driven music generation by providing a practical solution for live performance contexts, showcasing the potential for AI to enhance creative collaboration in music.
The paper introduces a novel framework for real-time human-AI musical co-performance using latent diffusion models (LDMs) integrated with MAX/MSP for low-latency audio processing. The methodology is well-articulated, detailing a sliding-window look-ahead protocol that enables the model to generate audio segments ahead of playback, thus addressing the latency challenges inherent in real-time music generation. The use of consistency distillation to enhance inference speed is particularly noteworthy, as it allows the model to maintain real-time capabilities while generating high-quality audio. The integration of a model-agnostic MAX/MSP external and a ready-to-use performance patch further enhances the practical applicability of the research.
The experiments are thorough, utilizing the Slakh2100 dataset and comparing the proposed models against established baselines such as StreamMusicGen. The evaluation metrics employed, including COCOLA scores for musical coherence, Beat F1 scores for rhythmic alignment, and Fréchet Audio Distance (FAD) for audio quality, provide a comprehensive assessment of the models' performance across different look-ahead configurations. The results demonstrate that the proposed models perform competitively, especially in the Look-ahead regime, indicating their effectiveness in real-time scenarios.
The authors provide detailed implementation information, including model architecture, training procedures, and evaluation metrics, which enhances reproducibility. The availability of code repositories and pre-trained model checkpoints further supports this aspect, allowing other researchers to replicate the study and build upon the findings.
One limitation is the reliance on a specific dataset (Slakh2100), which may not fully represent the diversity of musical styles and contexts encountered in real-world applications. Additionally, while the paper addresses latency effectively, the trade-offs between look-ahead depth and generation quality may still pose challenges in more complex musical scenarios. The subjective evaluation of generated music quality could also benefit from more extensive human listener studies.
The framework developed in this paper has significant implications for the future of human-AI collaboration in music performance, potentially transforming how musicians interact with AI systems in live settings. By bridging the gap between advanced generative models and real-time performance environments, this research opens avenues for innovative musical expressions and collaborative practices. The main contribution of this paper is the development of a real-time human-AI musical co-performance system that effectively generates instrumental accompaniment using latent diffusion models, addressing critical latency challenges while maintaining musical coherence and quality. This work significantly advances the field of AI-driven music generation by providing a practical solution for live performance contexts, showcasing the potential for AI to enhance creative collaboration in music.
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.
Primary: Chittagong University of Engineering and Technology
All Institutions: Chittagong University of Engineering and Technology
The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
The proposed methodology introduces a novel dual-path attention mechanism integrated into a complex-valued frequency transformation network (CFTNet), which is a significant advancement in the field of speech enhancement, particularly for cochlear implant users. The combination of intra-chunk and inter-chunk RNNs with attention modules allows for enhanced modeling of speech and noise dynamics in time-frequency representations. The detailed architecture and the rationale behind the design choices are well articulated, showcasing a thoughtful approach to addressing the limitations of existing models.
The experiments are robust, employing a comprehensive dataset that includes various noise conditions and SNR levels. The evaluation metrics used (STOI, PESQ, SISDR) are appropriate for assessing speech intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and a discussion on the statistical significance of the results.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of a specific dataset and the architecture of the model, there are no code repositories or links to a demo that would allow other researchers to replicate the findings. Providing access to the model and training scripts would greatly enhance reproducibility.
One limitation is the reliance on objective metrics without a thorough subjective evaluation involving human listeners. While objective scores are important, subjective assessments are crucial for applications in speech enhancement, especially for cochlear implant users. Additionally, the model's complexity may limit its applicability in real-time scenarios, which is a critical factor for practical implementations.
The proposed DAT-CFTNet has the potential to significantly improve the quality of life for cochlear implant recipients by enhancing speech intelligibility in noisy environments. This advancement could lead to better communication and social interactions for individuals with hearing impairments. The public availability of the model also encourages further research and development in the field. The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, University of Amsterdam
This paper presents a novel automatic framework for detecting speaker drift in synthesized speech, bridging geometric signal analysis with LLM-based perceptual reasoning. The comprehensive methodology, combined with strong experimental validation, positions this work as a significant contribution to the field of audio and speech synthesis, addressing a critical challenge in TTS systems.
The proposed methodology introduces a novel framework for detecting speaker drift in synthesized speech by formulating it as a binary classification task. The use of cosine similarity to assess speaker identity consistency is theoretically grounded, and the integration of large language models (LLMs) for perceptual reasoning is innovative. The construction of a synthetic benchmark dataset with human-validated annotations further strengthens the methodology, allowing for systematic evaluation of the proposed approach. However, the reliance on synthetic data may limit the generalizability of the findings.
The experimental setup is robust, utilizing a well-defined dataset and comparing the proposed method against fixed-threshold and PCA-based baselines. The results demonstrate a significant improvement in performance metrics (F1 score) when using the LLM-driven approach, indicating the effectiveness of the proposed method. The ablation studies provide valuable insights into the impact of different design choices on performance, reinforcing the validity of the findings.
While the paper provides a detailed description of the methodology and experimental setup, the absence of a publicly available code repository or dataset limits reproducibility. Future work should include making the dataset and code accessible to facilitate further research in this area.
One notable limitation is the reliance on synthetic data for training and evaluation, which may not fully capture the complexities of real-world speaker drift scenarios. Additionally, the framework's performance may vary with different TTS models, and further validation on diverse datasets is needed to establish its robustness.
The detection of speaker drift has significant implications for improving the quality and coherence of synthesized speech in various applications, including virtual assistants and interactive dialogue systems. By addressing this underexplored issue, the work contributes to enhancing user experience in TTS systems, paving the way for more reliable and natural-sounding synthetic speech. This paper presents a novel automatic framework for detecting speaker drift in synthesized speech, bridging geometric signal analysis with LLM-based perceptual reasoning. The comprehensive methodology, combined with strong experimental validation, positions this work as a significant contribution to the field of audio and speech synthesis, addressing a critical challenge in TTS systems.
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Fudan University, Wheatland Culture and Media Ltd.
This paper presents a significant advancement in controllable singing style conversion through innovative methodologies that address key challenges in the field. The combination of a boundary-aware semantic bottleneck, explicit technique control, and high-frequency band completion strategies demonstrates a comprehensive approach to improving the quality and fidelity of singing voice conversion systems.
The proposed methodology introduces a boundary-aware semantic bottleneck that effectively mitigates style leakage in singing voice conversion, which is a significant challenge in the field. The explicit frame-level technique matrix enhances control over dynamic styles, while the high-frequency band completion strategy addresses data scarcity issues. The integration of these components demonstrates a thoughtful approach to improving the quality and fidelity of converted singing voices, making the methodology both innovative and practical.
The experimental evaluation is robust, utilizing subjective metrics such as Mean Opinion Score (MOS) to assess naturalness and similarity, which are critical for audio applications. The results indicate that the proposed system outperforms other submissions in naturalness while maintaining competitive performance in speaker similarity and technique control. The ablation studies further validate the effectiveness of the proposed methods, providing a clear understanding of their contributions.
The paper includes sufficient implementation details and provides a GitHub repository for code access, which enhances reproducibility. The use of standard datasets and well-defined training protocols also supports the replicability of the results.
One limitation is the reliance on the official SVCC2025 dataset, which may not generalize well to other datasets or real-world applications. Additionally, while the system achieves high naturalness, there is a noted gap in identity similarity compared to top-performing systems that utilized larger external datasets.
The advancements in controllable singing style conversion have significant implications for music production, voice synthesis, and entertainment industries. The ability to manipulate singing styles with high fidelity can enhance creative expression and provide new tools for artists and producers. This paper presents a significant advancement in controllable singing style conversion through innovative methodologies that address key challenges in the field. The combination of a boundary-aware semantic bottleneck, explicit technique control, and high-frequency band completion strategies demonstrates a comprehensive approach to improving the quality and fidelity of singing voice conversion systems.
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.
Primary: IIT Kharagpur
All Institutions: IIT Kharagpur
This work introduces a novel morphing technique for voice biometrics that significantly enhances the potential for attacks on speaker verification systems. The comprehensive evaluation of the TD-VIM method across various devices and languages demonstrates its effectiveness and raises critical security concerns in the field of biometric authentication.
The proposed Time-Domain Voice Identity Morphing (TD-VIM) method innovatively performs morphing at the signal level, circumventing the limitations of previous feature-based approaches. By selecting portions of voice signals and averaging them, the method achieves a language and backbone independence that enhances its applicability across diverse speaker verification systems. The methodology is well-structured, with clear steps outlined for signal selection, preprocessing, and morphing, although the paper could benefit from more detailed mathematical formulations and justifications for the choices made during these processes.
The experiments are comprehensive, utilizing a robust dataset (MAVS) and multiple speaker verification systems (SVS) to evaluate the effectiveness of the TD-VIM approach. The use of the Generalized Morph Attack Potential (G-MAP) metric provides a solid framework for quantifying the vulnerability of SVS to morphing attacks. Results indicate high attack success rates across different devices and languages, demonstrating the method's effectiveness. However, the paper could improve by including more comparative analyses with existing methods to highlight its advantages.
The authors provide access to the source code and morphed samples upon request, which is a positive aspect for reproducibility. However, the paper lacks detailed instructions on how to replicate the experiments fully, such as specific configurations and parameter settings used during the experiments.
One limitation is the reliance on a specific dataset (MAVS), which may not generalize to all voice biometric systems. Additionally, the paper does not address potential ethical concerns related to the misuse of morphing techniques in biometric systems. The impact of different environmental factors on the morphing effectiveness is also not explored, which could affect real-world applications.
The findings of this research have significant implications for the security of voice biometric systems, particularly in sensitive applications like banking and finance. By highlighting vulnerabilities, the work encourages the development of more robust verification systems and raises awareness about the potential for morphing attacks. The proposed method could lead to advancements in biometric security measures, prompting further research into countermeasures against such vulnerabilities. This work introduces a novel morphing technique for voice biometrics that significantly enhances the potential for attacks on speaker verification systems. The comprehensive evaluation of the TD-VIM method across various devices and languages demonstrates its effectiveness and raises critical security concerns in the field of biometric authentication.
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
The paper introduces the Anchored Cyclic Generation (ACG) paradigm, which effectively addresses the error accumulation problem in autoregressive models for long-sequence symbolic music generation. The methodology is well-structured, employing a hierarchical approach through the Hi-ACG framework that combines global and local generation strategies. The use of a novel piano token representation enhances efficiency and interpretability. The proposed methods are theoretically sound, supported by mathematical analysis, and demonstrate a clear innovation in the field of music generation.
The experimental evaluation is robust, utilizing both objective and subjective metrics to assess the performance of the proposed models against established baselines. The datasets used (MuseScore and POP909) are appropriate for the task, and the results indicate significant improvements in generation quality, as evidenced by a 34.7% reduction in cosine distance between predicted and ground-truth features. The comprehensive evaluation strategy enhances the credibility of the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate validation of results.
The paper acknowledges limitations in fine-grained control during generation and the potential loss of subtle timing nuances in the piano token representation. Additionally, the focus on piano music may restrict the applicability of the framework to other musical contexts. Future research should address these limitations by integrating more expressive tokens and extending the framework to multi-track music generation.
The proposed ACG paradigm has the potential to significantly advance the field of symbolic music generation, offering new avenues for creating high-quality, structurally coherent music. Its principles could be adapted to other long-sequence generation tasks beyond music, such as text generation and structured content synthesis, thereby broadening its impact across various domains. The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.
Primary: National Taiwan University
All Institutions: National Taiwan University, NVIDIA Taiwan
The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
The paper introduces a novel joint fullband-subband modeling framework, Sing-HiResNet, which effectively captures both global and localized spectral features for high-resolution SingFake detection. The methodology is well-structured, employing a two-phase approach that integrates fullband and subband models, and explores various fusion strategies to enhance detection performance. The use of high-resolution audio (44.1 kHz) is a significant advancement over conventional methods, and the systematic evaluation of subband contributions adds depth to the methodology. However, the paper could benefit from clearer explanations of the fusion strategies and their implications.
The experiments are robust, utilizing the WildSVDD dataset to benchmark the proposed method against existing state-of-the-art systems. The results demonstrate a significant performance improvement over traditional 16 kHz models, achieving a state-of-the-art EER of 1.58%. The comparative analysis of different fusion strategies provides valuable insights into the effectiveness of the proposed approach. However, the paper lacks detailed statistical analysis of the results, which would strengthen the findings.
The paper provides a comprehensive description of the experimental setup, including dataset preparation, model architecture, and training procedures. However, it lacks a public code repository or demo URL, which would enhance reproducibility. The absence of shared resources limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a single dataset (WildSVDD), which may not fully capture the diversity of real-world singing voice deepfakes. Additionally, while the paper discusses various fusion strategies, it does not explore the computational efficiency of these methods, which could be a concern for real-time applications. The authors could also provide more insights into the potential impact of noise and other artifacts in the audio data.
The research addresses a critical issue in the realm of audio synthesis and deepfake detection, with implications for copyright protection, content authenticity, and the broader field of audio forensics. The findings could inform future developments in anti-spoofing technologies and contribute to the establishment of standards for audio quality evaluation in deepfake detection. The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
The proposed BDATP framework introduces two innovative components: the Binaural Difference Attention (BDA) module, which enhances spatial audio perception by focusing on interaural differences, and the Action Transition Prediction (ATP) task, which regularizes policy learning to improve generalization across unseen environments. This dual approach effectively addresses the limitations of existing AVN methods, particularly their tendency to overfit to specific training conditions. The methodology is well-structured, with clear explanations of how each component contributes to the overall framework.
The experiments are comprehensive, utilizing two well-known datasets (Replica and Matterport3D) to evaluate the effectiveness of BDATP. The authors provide a thorough comparison against several state-of-the-art baselines, demonstrating significant performance improvements in both heard and unheard sound categories. The metrics used (Success Rate, Success weighted by Path Length, and Success weighted by Number of Actions) are appropriate for the task and provide a clear picture of the framework's capabilities.
The paper lacks explicit details on the implementation, such as hyperparameters, training procedures, and code availability, which could hinder reproducibility. While the methodology is described in detail, providing access to the code and models would greatly enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world environments. Additionally, while the proposed methods show strong performance in zero-shot settings, the paper does not address how the framework would perform in dynamic environments with moving sound sources or in multi-agent scenarios.
The BDATP framework has the potential to significantly advance the field of audio-visual navigation, particularly in applications involving robotics and autonomous systems. Its focus on generalization could lead to more robust navigation systems in real-world scenarios, enhancing the capabilities of embodied agents in complex environments. The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.
Primary: unknown
All Institutions: unknown
The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
The methodology is robust, introducing a novel benchmark (FDB-v3) that evaluates spoken language models under realistic conditions, utilizing real human audio annotated for disfluencies. The design incorporates multi-step tool use across various domains, which is a significant advancement over previous benchmarks that relied on synthetic data or single-step tasks. The systematic approach to scenario formulation and audio collection enhances the validity of the evaluation.
The experiments are comprehensive, evaluating six different model configurations across multiple dimensions such as accuracy, latency, and turn-taking dynamics. The results are well-presented, showing clear performance differences among models and highlighting specific strengths and weaknesses, particularly in handling disfluencies and multi-step reasoning. The use of deterministic mock APIs for evaluation is a strong point, ensuring that the results are not confounded by external factors.
The paper provides sufficient detail regarding the experimental setup, including the models evaluated and the evaluation metrics used. However, the lack of specific implementation details or code availability limits reproducibility. The benchmark is open and reproducible, which is a positive aspect, but without access to the models, full replication of results may be challenging.
The study acknowledges limitations, such as the fixed server region for cloud-based evaluations and the lack of robustness testing against real-world network anomalies. Additionally, the dataset is relatively small (100 recordings), which may affect generalizability. The focus on specific disfluency categories may also overlook other potential challenges in real-world interactions.
This work has significant implications for the development of real-time voice agents, particularly in enhancing their ability to handle natural speech disfluencies and multi-step tasks. The findings suggest directions for future research, emphasizing the need for models that can balance speed and accuracy in dynamic conversational contexts. The benchmark itself could facilitate further advancements in the field by providing a standardized evaluation framework. The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
The proposed OmniSonic framework introduces a flow-matching-based diffusion model that effectively integrates video and text to generate comprehensive auditory scenes. The TriAttn-DiT architecture is a notable innovation, allowing simultaneous processing of on-screen environmental sounds, off-screen sounds, and speech conditions. The use of a Mixture-of-Experts (MoE) gating mechanism is a sophisticated approach that enhances the model's adaptability during audio generation. This methodology is well-structured and addresses the limitations of previous models, particularly in generating human speech alongside environmental sounds.
The authors present extensive experiments that demonstrate the superiority of OmniSonic over existing state-of-the-art methods. The creation of the UniHAGen-Bench benchmark, which includes over a thousand samples across diverse scenarios, is a significant contribution that facilitates fair evaluation and comparison in the field. The combination of objective metrics and human evaluations provides a robust assessment of the model's performance, although specific metrics used for evaluation could be elaborated further for clarity.
The paper provides a project page with a URL, but lacks detailed implementation specifics in the text that would enhance reproducibility. While the methodology is sound, the absence of code or detailed experimental setups may hinder other researchers from replicating the results.
One limitation is the lack of detailed discussion on the computational resources required for training the OmniSonic model, which could be a barrier for some researchers. Additionally, while the model excels in generating audio from video and text, its performance in more nuanced or complex auditory environments remains to be fully explored.
The ability to generate holistic audio from multimodal inputs has significant implications for various applications, including film and video production, virtual reality, and assistive technologies for the hearing impaired. The advancements in audio generation could lead to more immersive experiences in entertainment and education, making this research highly relevant to both academic and industry stakeholders. The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Primary: CUHK MMLab
All Institutions: CUHK MMLab
The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
The AURA framework presents a comprehensive end-to-end approach for real-time video understanding and interaction. It effectively integrates context management and data construction, which are crucial for maintaining continuity in long-horizon interactions. The methodology is well-structured, addressing the limitations of existing VideoLLMs by providing a unified model that supports both real-time question answering and proactive responses. The incorporation of ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems at a reasonable frame rate demonstrates a practical application of the proposed methods.
The experiments conducted show that AURA achieves state-of-the-art performance on relevant streaming benchmarks, which is a significant accomplishment. The evaluation metrics used to assess performance should ideally include both subjective and objective measures to provide a comprehensive view of the model's capabilities. However, the paper could benefit from a more detailed breakdown of the datasets used and their characteristics, as well as comparisons with other contemporary systems.
The paper mentions the release of the AURA model and a real-time inference framework, which is a positive step towards reproducibility. However, further details regarding the training process, hyperparameters, and the specific configurations used in experiments would enhance reproducibility efforts. Clear documentation and access to code would be essential for other researchers to replicate the findings.
One limitation is the reliance on specific hardware (80G accelerators) for achieving the reported performance, which may not be accessible to all researchers. Additionally, while the system is designed for real-time interaction, the practical implications of latency and response times in diverse real-world scenarios are not fully explored. The paper could also discuss potential biases in the data or limitations in the model's understanding of complex interactions.
AURA has significant potential applications in various fields, including education, healthcare, and entertainment, where real-time video interaction is valuable. By enabling continuous observation and interaction, it could enhance user experiences in virtual environments and assistive technologies. The release of the model and framework could foster further research and development in real-time video understanding systems. The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.
Primary: Southeast University
All Institutions: Southeast University, Shenzhen Loop Area Institute, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Technical University of Munich, Imperial College London
The paper presents AffectSpeech, a large-scale emotional speech dataset with fine-grained textual descriptions, addressing the limitations of traditional emotion representation methods. The innovative methodology and comprehensive evaluation underscore its potential to advance research in speech emotion recognition and synthesis, making it a valuable resource for the community.
The paper introduces a novel human-LLM collaborative annotation pipeline that enhances the quality and richness of emotional speech data. By integrating algorithmic pre-labeling, multi-LLM description generation, and human verification, the authors effectively address the challenges of annotation scalability and reliability. The dataset's multi-dimensional annotations across sentiment polarity, emotional intensity, prosodic attributes, and semantic content are well-structured, enabling comprehensive modeling of emotional speech. The methodology is innovative and well-articulated, contributing significantly to the field of speech emotion recognition and synthesis.
The experimental results demonstrate the effectiveness of the AffectSpeech dataset in improving the performance of speech emotion captioning and synthesis models. The authors provide thorough evaluations using both objective metrics (e.g., emotion accuracy, prosody accuracy) and subjective assessments (e.g., human preference tests). The results consistently show that models trained on AffectSpeech outperform those trained on existing datasets, validating the dataset's utility. The comprehensive evaluation across multiple models and tasks strengthens the paper's claims about the dataset's impact.
The paper provides detailed descriptions of the dataset construction, annotation process, and experimental setup, which facilitates reproducibility. However, the actual implementation details, such as specific model architectures and training configurations, could be more explicitly outlined to enhance reproducibility further. The availability of the dataset and demo on GitHub is a positive aspect for researchers looking to replicate the study.
While the dataset is extensive and well-annotated, potential limitations include the reliance on human annotators, which may introduce variability in the quality of annotations. Additionally, the dataset is currently limited to English, which may restrict its applicability in multilingual contexts. Future work should consider expanding the dataset to include diverse languages and dialects.
The AffectSpeech dataset has significant implications for various applications, including empathetic conversational agents, affect-aware human-computer interaction systems, and emotional speech synthesis in entertainment and education. By providing a more nuanced representation of emotional speech, it can enhance user experiences in interactive systems and contribute to advancements in affective computing. The paper presents AffectSpeech, a large-scale emotional speech dataset with fine-grained textual descriptions, addressing the limitations of traditional emotion representation methods. The innovative methodology and comprehensive evaluation underscore its potential to advance research in speech emotion recognition and synthesis, making it a valuable resource for the community.
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.
Primary: KDDI Research, Inc.
All Institutions: KDDI Research, Inc.
The main contribution of this paper is the introduction of HSC-MAE, a novel hierarchical framework for unsupervised audio-visual representation learning that effectively addresses the challenges of weakly paired data through a dual-path teacher-student architecture. This work represents a significant step forward in the field, providing a robust methodology that enhances the alignment of audio and visual modalities while demonstrating strong empirical results.
The proposed HSC-MAE framework introduces a dual-path teacher-student architecture that innovatively integrates three levels of semantic correlation—global, local, and sample-level. This hierarchical approach is a significant advancement in unsupervised audio-visual representation learning, as it effectively addresses the challenges posed by weakly paired data and spurious co-occurrences. The use of DCCA for global-level alignment and the introduction of teacher-mined soft top-k affinities for local-level correlation are particularly noteworthy, as they enhance the robustness of the learned representations. The methodology is well-structured and demonstrates a clear understanding of the complexities involved in multimodal learning.
The experiments conducted on the AVE and VEGAS datasets provide strong empirical validation of the proposed method. The reported substantial improvements in mean Average Precision (mAP) over existing unsupervised baselines indicate that HSC-MAE is effective in producing high-quality audio-visual embeddings. However, the paper could benefit from a more detailed comparison with state-of-the-art methods and additional qualitative analyses to further substantiate the claims made regarding the quality of the learned representations.
The paper lacks detailed implementation specifics, such as hyperparameter settings, training protocols, and data preprocessing steps, which are crucial for reproducibility. Including a supplementary material section or a dedicated reproducibility appendix would enhance the paper's value and allow other researchers to replicate the results more easily.
One limitation of the study is the reliance on weakly paired data, which may not fully capture the complexity of real-world audio-visual relationships. Additionally, while the proposed method shows promise, it would be beneficial to explore its performance across a wider range of datasets and tasks to assess its generalizability. The paper also does not address potential computational overheads associated with the dual-path architecture, which may limit its applicability in resource-constrained environments.
The HSC-MAE framework has the potential to significantly advance the field of unsupervised learning in audio-visual contexts, with applications in areas such as multimedia content analysis, automated video tagging, and improved human-computer interaction systems. By enhancing the quality of multimodal embeddings, this work could facilitate more sophisticated applications in AI-driven technologies, including virtual reality and augmented reality systems. The main contribution of this paper is the introduction of HSC-MAE, a novel hierarchical framework for unsupervised audio-visual representation learning that effectively addresses the challenges of weakly paired data through a dual-path teacher-student architecture. This work represents a significant step forward in the field, providing a robust methodology that enhances the alignment of audio and visual modalities while demonstrating strong empirical results.
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MALEFA, a lightweight zero-shot keyword spotting framework that effectively reduces false alarms while maintaining high accuracy through innovative multi-granularity contrastive learning and a tailored loss function. This work significantly advances the state of the art in keyword spotting, particularly in resource-constrained environments, and addresses critical challenges in distinguishing similar acoustic keywords.
The proposed MALEFA framework integrates multi-granularity contrastive learning with a novel false alarm-aware loss, which is a significant advancement in the field of zero-shot keyword spotting (ZSKWS). The methodology effectively combines utterance-level and phoneme-level learning objectives, which allows for improved alignment and accuracy in distinguishing acoustically similar keywords. The use of cross-attention mechanisms enhances the model's ability to align audio and text representations, thereby addressing a critical challenge in KWS systems. The design is lightweight, making it suitable for real-time deployment on resource-constrained devices, which is a notable practical consideration.
The experiments conducted on four public benchmark datasets demonstrate the effectiveness of MALEFA, achieving high accuracy (90%) and a remarkably low false alarm rate (0.007%) on the AMI dataset. The ablation studies provide strong evidence for the contributions of each component of the model, confirming that the integration of the proposed loss functions and learning objectives is essential for achieving state-of-the-art performance. The comparisons with existing models highlight MALEFA's robustness and efficiency, making it a competitive solution in the field.
The paper provides sufficient implementation details, including the architecture, training criteria, and experimental setup, which enhances reproducibility. However, the lack of specific citations for some methodologies and datasets may hinder complete reproducibility for external researchers. The use of a GitHub repository for the code is a positive aspect, allowing others to access and verify the implementation.
One limitation of the study is the reliance on specific datasets for evaluation, which may not fully represent the diversity of real-world scenarios in keyword spotting. Additionally, while the model shows promise in reducing false alarms, further exploration of its performance across different languages and accents would be beneficial. The paper also does not address potential biases in the training data, which could affect the model's generalization capabilities.
The MALEFA framework has significant implications for the development of adaptable and personalized voice interfaces, particularly in applications where user-defined keywords are essential. Its lightweight nature makes it suitable for deployment on various devices, including smartphones and smart home assistants, potentially enhancing user experience in everyday interactions. The approach could also pave the way for further research in zero-shot learning and its applications in other domains. The main contribution of this paper is the introduction of MALEFA, a lightweight zero-shot keyword spotting framework that effectively reduces false alarms while maintaining high accuracy through innovative multi-granularity contrastive learning and a tailored loss function. This work significantly advances the state of the art in keyword spotting, particularly in resource-constrained environments, and addresses critical challenges in distinguishing similar acoustic keywords.