Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Primary: CUHK MMLab
All Institutions: CUHK MMLab
The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
The AURA framework presents a comprehensive end-to-end approach for real-time video understanding and interaction. It effectively integrates context management and data construction, which are crucial for maintaining continuity in long-horizon interactions. The methodology is well-structured, addressing the limitations of existing VideoLLMs by providing a unified model that supports both real-time question answering and proactive responses. The incorporation of ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems at a reasonable frame rate demonstrates a practical application of the proposed methods.
The experiments conducted show that AURA achieves state-of-the-art performance on relevant streaming benchmarks, which is a significant accomplishment. The evaluation metrics used to assess performance should ideally include both subjective and objective measures to provide a comprehensive view of the model's capabilities. However, the paper could benefit from a more detailed breakdown of the datasets used and their characteristics, as well as comparisons with other contemporary systems.
The paper mentions the release of the AURA model and a real-time inference framework, which is a positive step towards reproducibility. However, further details regarding the training process, hyperparameters, and the specific configurations used in experiments would enhance reproducibility efforts. Clear documentation and access to code would be essential for other researchers to replicate the findings.
One limitation is the reliance on specific hardware (80G accelerators) for achieving the reported performance, which may not be accessible to all researchers. Additionally, while the system is designed for real-time interaction, the practical implications of latency and response times in diverse real-world scenarios are not fully explored. The paper could also discuss potential biases in the data or limitations in the model's understanding of complex interactions.
AURA has significant potential applications in various fields, including education, healthcare, and entertainment, where real-time video interaction is valuable. By enabling continuous observation and interaction, it could enhance user experiences in virtual environments and assistive technologies. The release of the model and framework could foster further research and development in real-time video understanding systems. The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.
Primary: Northwestern Polytechnical University
All Institutions: Nanjing University, Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
The methodology presented in the paper is innovative, leveraging an end-to-end Speech LLM architecture that integrates multi-turn temporal reasoning with a speaker-aware context cache. The iterative global-to-local processing approach is a significant departure from traditional single-pass models, addressing the challenges of overlapping speech and rapid turn-taking effectively. The three-stage progressive training strategy is well-conceived, allowing the model to learn complex interactions and maintain speaker consistency across long-form audio. However, the paper could benefit from a more detailed explanation of the training process and the specific mechanisms used for temporal reasoning.
The experiments are robust, utilizing two well-defined datasets (AliMeeting and AISHELL-4) that reflect real-world challenges in multi-speaker scenarios. The reported results show consistent improvements over strong baselines, particularly in metrics relevant to speaker attribution and transcription accuracy. The use of multiple evaluation metrics (DER, CER, cpCER) provides a comprehensive view of the model's performance. However, the paper lacks a thorough comparison with other state-of-the-art models beyond the immediate baselines, which would strengthen the claims of superiority.
The paper provides sufficient details regarding the model architecture, training procedures, and datasets, which are crucial for reproducibility. The use of established frameworks (e.g., MS-Swift, Megatron-LM) and the clear description of the training stages contribute positively to reproducibility. However, the absence of publicly available code or a demo limits the ease of replication by other researchers.
One limitation of the proposed model is its reliance on the quality of the training data, which may not generalize well to all multi-speaker environments. Additionally, while the speaker-aware cache is a novel approach, it may introduce complexity in managing speaker identities over long recordings. The performance on long-form audio without manual segmentation could also be a concern, as it may not perform as well in highly dynamic environments.
The implications of this research are significant, particularly for applications in meeting transcription, intelligent assistants, and any domain requiring accurate speaker attribution in multi-speaker contexts. The advancements in handling overlapping speech and rapid turn-taking could enhance the usability of speech recognition systems in real-world scenarios, leading to improved accessibility and communication tools. The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
Primary: Ben Gurion University, Be'er Sheva, Israel
All Institutions: Ben Gurion University, University of Haifa
The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
The proposed split-and-conquer framework effectively decomposes the complex task of partial deepfake speech detection into two distinct stages: boundary detection and segment-level classification. This separation allows for a more focused learning objective, enhancing the model's ability to localize manipulated regions accurately. The use of a dedicated boundary detector to identify transition points is a significant methodological innovation, as it reduces the ambiguity and noise typically associated with joint localization and classification tasks. The introduction of a reflection-based multi-length training strategy is also noteworthy, as it generates diverse feature-space representations, improving robustness and performance across various temporal resolutions.
The experiments conducted on the PartialSpoof and Half-Truth datasets demonstrate state-of-the-art performance, showcasing the effectiveness of the proposed method. The results indicate substantial improvements in both detection accuracy and localization capabilities, particularly at stricter evaluation criteria. The comprehensive evaluation across multiple configurations, feature extractors, and augmentation strategies provides a robust assessment of the method's performance, highlighting its generalization capabilities and robustness to boundary estimation errors.
The paper provides detailed descriptions of the experimental setup, including model architectures, training procedures, and evaluation metrics, which enhances reproducibility. The availability of a project repository on GitHub further supports reproducibility efforts, allowing other researchers to replicate the experiments and build upon the proposed framework.
Despite the strengths of the proposed method, there are notable limitations. The reliance on boundary prediction can introduce errors that propagate through the classification stage, particularly in challenging transition regions. Additionally, the assumption that manipulated content can be approximated by piecewise-uniform segments may not fully capture more gradual or subtle manipulations, which could limit the method's applicability in real-world scenarios.
The implications of this research are significant, particularly in the context of security-critical systems that rely on voice-based authentication and speaker verification. The ability to detect partial deepfake speech can enhance the integrity of communication systems and mitigate risks associated with audio deepfakes. Furthermore, the methodological advancements presented in this work may inspire further research in audio forensics and anti-spoofing technologies. The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Primary: Shanghai Jiao Tong University
All Institutions: EPIC Lab, Shanghai Jiao Tong University, Xidian University, HKUST (GZ)
The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work represents a meaningful advancement in the field of audio processing, demonstrating innovative methodologies that address critical challenges in deploying large-scale models effectively.
The proposed methodology, AudioKV, innovatively addresses the inefficiencies of Key-Value (KV) cache management in Large Audio-Language Models (LALMs) by introducing a dual mechanism: audio-aware KV cache allocation and Spectral Score Smoothing (SSS). The former identifies and prioritizes audio-critical attention heads based on their relevance to acoustic modeling, while the latter employs a frequency-domain approach to stabilize importance score estimation. This dual approach is particularly effective in the audio domain, where temporal continuity is crucial, showcasing a thoughtful adaptation of existing techniques to a new modality.
The experiments conducted across various benchmarks, including Automatic Speech Recognition (ASR) and Speech Translation (ST), demonstrate that AudioKV significantly outperforms existing methods, particularly under aggressive compression scenarios. The results indicate not only improved accuracy but also enhanced robustness against performance degradation, which is critical for practical applications. The use of diverse datasets strengthens the validity of the findings, although the paper could benefit from more extensive comparisons with a broader range of state-of-the-art methods.
The paper mentions that the code will be released upon acceptance, which is a positive aspect for reproducibility. However, the lack of a demo URL or a project repository at this stage limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but actual code availability will be crucial for broader adoption and validation of the results.
One limitation is the potential for overfitting to specific datasets, as the performance improvements are primarily demonstrated on selected benchmarks. Additionally, while the method shows promise in maintaining accuracy at high compression ratios, the paper does not thoroughly explore the trade-offs involved in different compression strategies or the impact on latency and real-time processing capabilities.
The implications of this work extend to various applications in speech processing and multimodal AI systems, where efficient inference is paramount. By improving the efficiency of LALMs, this research could facilitate the deployment of advanced audio processing systems in resource-constrained environments, such as mobile devices or real-time applications. The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work represents a meaningful advancement in the field of audio processing, demonstrating innovative methodologies that address critical challenges in deploying large-scale models effectively.
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.
Primary: Chittagong University of Engineering and Technology
All Institutions: Chittagong University of Engineering and Technology
The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
The proposed methodology introduces a novel dual-path attention mechanism integrated into a complex-valued frequency transformation network (CFTNet), which is a significant advancement in the field of speech enhancement, particularly for cochlear implant users. The combination of intra-chunk and inter-chunk RNNs with attention modules allows for enhanced modeling of speech and noise dynamics in time-frequency representations. The detailed architecture and the rationale behind the design choices are well articulated, showcasing a thoughtful approach to addressing the limitations of existing models.
The experiments are robust, employing a comprehensive dataset that includes various noise conditions and SNR levels. The evaluation metrics used (STOI, PESQ, SISDR) are appropriate for assessing speech intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and a discussion on the statistical significance of the results.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of a specific dataset and the architecture of the model, there are no code repositories or links to a demo that would allow other researchers to replicate the findings. Providing access to the model and training scripts would greatly enhance reproducibility.
One limitation is the reliance on objective metrics without a thorough subjective evaluation involving human listeners. While objective scores are important, subjective assessments are crucial for applications in speech enhancement, especially for cochlear implant users. Additionally, the model's complexity may limit its applicability in real-time scenarios, which is a critical factor for practical implementations.
The proposed DAT-CFTNet has the potential to significantly improve the quality of life for cochlear implant recipients by enhancing speech intelligibility in noisy environments. This advancement could lead to better communication and social interactions for individuals with hearing impairments. The public availability of the model also encourages further research and development in the field. The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, University of Amsterdam
This paper presents a novel automatic framework for detecting speaker drift in synthesized speech, bridging geometric signal analysis with LLM-based perceptual reasoning. The comprehensive methodology, combined with strong experimental validation, positions this work as a significant contribution to the field of audio and speech synthesis, addressing a critical challenge in TTS systems.
The proposed methodology introduces a novel framework for detecting speaker drift in synthesized speech by formulating it as a binary classification task. The use of cosine similarity to assess speaker identity consistency is theoretically grounded, and the integration of large language models (LLMs) for perceptual reasoning is innovative. The construction of a synthetic benchmark dataset with human-validated annotations further strengthens the methodology, allowing for systematic evaluation of the proposed approach. However, the reliance on synthetic data may limit the generalizability of the findings.
The experimental setup is robust, utilizing a well-defined dataset and comparing the proposed method against fixed-threshold and PCA-based baselines. The results demonstrate a significant improvement in performance metrics (F1 score) when using the LLM-driven approach, indicating the effectiveness of the proposed method. The ablation studies provide valuable insights into the impact of different design choices on performance, reinforcing the validity of the findings.
While the paper provides a detailed description of the methodology and experimental setup, the absence of a publicly available code repository or dataset limits reproducibility. Future work should include making the dataset and code accessible to facilitate further research in this area.
One notable limitation is the reliance on synthetic data for training and evaluation, which may not fully capture the complexities of real-world speaker drift scenarios. Additionally, the framework's performance may vary with different TTS models, and further validation on diverse datasets is needed to establish its robustness.
The detection of speaker drift has significant implications for improving the quality and coherence of synthesized speech in various applications, including virtual assistants and interactive dialogue systems. By addressing this underexplored issue, the work contributes to enhancing user experience in TTS systems, paving the way for more reliable and natural-sounding synthetic speech. This paper presents a novel automatic framework for detecting speaker drift in synthesized speech, bridging geometric signal analysis with LLM-based perceptual reasoning. The comprehensive methodology, combined with strong experimental validation, positions this work as a significant contribution to the field of audio and speech synthesis, addressing a critical challenge in TTS systems.
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Fudan University, Wheatland Culture and Media Ltd.
This paper presents a significant advancement in controllable singing style conversion through innovative methodologies that address key challenges in the field. The combination of a boundary-aware semantic bottleneck, explicit technique control, and high-frequency band completion strategies demonstrates a comprehensive approach to improving the quality and fidelity of singing voice conversion systems.
The proposed methodology introduces a boundary-aware semantic bottleneck that effectively mitigates style leakage in singing voice conversion, which is a significant challenge in the field. The explicit frame-level technique matrix enhances control over dynamic styles, while the high-frequency band completion strategy addresses data scarcity issues. The integration of these components demonstrates a thoughtful approach to improving the quality and fidelity of converted singing voices, making the methodology both innovative and practical.
The experimental evaluation is robust, utilizing subjective metrics such as Mean Opinion Score (MOS) to assess naturalness and similarity, which are critical for audio applications. The results indicate that the proposed system outperforms other submissions in naturalness while maintaining competitive performance in speaker similarity and technique control. The ablation studies further validate the effectiveness of the proposed methods, providing a clear understanding of their contributions.
The paper includes sufficient implementation details and provides a GitHub repository for code access, which enhances reproducibility. The use of standard datasets and well-defined training protocols also supports the replicability of the results.
One limitation is the reliance on the official SVCC2025 dataset, which may not generalize well to other datasets or real-world applications. Additionally, while the system achieves high naturalness, there is a noted gap in identity similarity compared to top-performing systems that utilized larger external datasets.
The advancements in controllable singing style conversion have significant implications for music production, voice synthesis, and entertainment industries. The ability to manipulate singing styles with high fidelity can enhance creative expression and provide new tools for artists and producers. This paper presents a significant advancement in controllable singing style conversion through innovative methodologies that address key challenges in the field. The combination of a boundary-aware semantic bottleneck, explicit technique control, and high-frequency band completion strategies demonstrates a comprehensive approach to improving the quality and fidelity of singing voice conversion systems.
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.
Primary: IIT Kharagpur
All Institutions: IIT Kharagpur
This work introduces a novel morphing technique for voice biometrics that significantly enhances the potential for attacks on speaker verification systems. The comprehensive evaluation of the TD-VIM method across various devices and languages demonstrates its effectiveness and raises critical security concerns in the field of biometric authentication.
The proposed Time-Domain Voice Identity Morphing (TD-VIM) method innovatively performs morphing at the signal level, circumventing the limitations of previous feature-based approaches. By selecting portions of voice signals and averaging them, the method achieves a language and backbone independence that enhances its applicability across diverse speaker verification systems. The methodology is well-structured, with clear steps outlined for signal selection, preprocessing, and morphing, although the paper could benefit from more detailed mathematical formulations and justifications for the choices made during these processes.
The experiments are comprehensive, utilizing a robust dataset (MAVS) and multiple speaker verification systems (SVS) to evaluate the effectiveness of the TD-VIM approach. The use of the Generalized Morph Attack Potential (G-MAP) metric provides a solid framework for quantifying the vulnerability of SVS to morphing attacks. Results indicate high attack success rates across different devices and languages, demonstrating the method's effectiveness. However, the paper could improve by including more comparative analyses with existing methods to highlight its advantages.
The authors provide access to the source code and morphed samples upon request, which is a positive aspect for reproducibility. However, the paper lacks detailed instructions on how to replicate the experiments fully, such as specific configurations and parameter settings used during the experiments.
One limitation is the reliance on a specific dataset (MAVS), which may not generalize to all voice biometric systems. Additionally, the paper does not address potential ethical concerns related to the misuse of morphing techniques in biometric systems. The impact of different environmental factors on the morphing effectiveness is also not explored, which could affect real-world applications.
The findings of this research have significant implications for the security of voice biometric systems, particularly in sensitive applications like banking and finance. By highlighting vulnerabilities, the work encourages the development of more robust verification systems and raises awareness about the potential for morphing attacks. The proposed method could lead to advancements in biometric security measures, prompting further research into countermeasures against such vulnerabilities. This work introduces a novel morphing technique for voice biometrics that significantly enhances the potential for attacks on speaker verification systems. The comprehensive evaluation of the TD-VIM method across various devices and languages demonstrates its effectiveness and raises critical security concerns in the field of biometric authentication.
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
The paper introduces the Anchored Cyclic Generation (ACG) paradigm, which effectively addresses the error accumulation problem in autoregressive models for long-sequence symbolic music generation. The methodology is well-structured, employing a hierarchical approach through the Hi-ACG framework that combines global and local generation strategies. The use of a novel piano token representation enhances efficiency and interpretability. The proposed methods are theoretically sound, supported by mathematical analysis, and demonstrate a clear innovation in the field of music generation.
The experimental evaluation is robust, utilizing both objective and subjective metrics to assess the performance of the proposed models against established baselines. The datasets used (MuseScore and POP909) are appropriate for the task, and the results indicate significant improvements in generation quality, as evidenced by a 34.7% reduction in cosine distance between predicted and ground-truth features. The comprehensive evaluation strategy enhances the credibility of the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate validation of results.
The paper acknowledges limitations in fine-grained control during generation and the potential loss of subtle timing nuances in the piano token representation. Additionally, the focus on piano music may restrict the applicability of the framework to other musical contexts. Future research should address these limitations by integrating more expressive tokens and extending the framework to multi-track music generation.
The proposed ACG paradigm has the potential to significantly advance the field of symbolic music generation, offering new avenues for creating high-quality, structurally coherent music. Its principles could be adapted to other long-sequence generation tasks beyond music, such as text generation and structured content synthesis, thereby broadening its impact across various domains. The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.
Primary: National Taiwan University
All Institutions: National Taiwan University, NVIDIA Taiwan
The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
The paper introduces a novel joint fullband-subband modeling framework, Sing-HiResNet, which effectively captures both global and localized spectral features for high-resolution SingFake detection. The methodology is well-structured, employing a two-phase approach that integrates fullband and subband models, and explores various fusion strategies to enhance detection performance. The use of high-resolution audio (44.1 kHz) is a significant advancement over conventional methods, and the systematic evaluation of subband contributions adds depth to the methodology. However, the paper could benefit from clearer explanations of the fusion strategies and their implications.
The experiments are robust, utilizing the WildSVDD dataset to benchmark the proposed method against existing state-of-the-art systems. The results demonstrate a significant performance improvement over traditional 16 kHz models, achieving a state-of-the-art EER of 1.58%. The comparative analysis of different fusion strategies provides valuable insights into the effectiveness of the proposed approach. However, the paper lacks detailed statistical analysis of the results, which would strengthen the findings.
The paper provides a comprehensive description of the experimental setup, including dataset preparation, model architecture, and training procedures. However, it lacks a public code repository or demo URL, which would enhance reproducibility. The absence of shared resources limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a single dataset (WildSVDD), which may not fully capture the diversity of real-world singing voice deepfakes. Additionally, while the paper discusses various fusion strategies, it does not explore the computational efficiency of these methods, which could be a concern for real-time applications. The authors could also provide more insights into the potential impact of noise and other artifacts in the audio data.
The research addresses a critical issue in the realm of audio synthesis and deepfake detection, with implications for copyright protection, content authenticity, and the broader field of audio forensics. The findings could inform future developments in anti-spoofing technologies and contribute to the establishment of standards for audio quality evaluation in deepfake detection. The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
The proposed BDATP framework introduces two innovative components: the Binaural Difference Attention (BDA) module, which enhances spatial audio perception by focusing on interaural differences, and the Action Transition Prediction (ATP) task, which regularizes policy learning to improve generalization across unseen environments. This dual approach effectively addresses the limitations of existing AVN methods, particularly their tendency to overfit to specific training conditions. The methodology is well-structured, with clear explanations of how each component contributes to the overall framework.
The experiments are comprehensive, utilizing two well-known datasets (Replica and Matterport3D) to evaluate the effectiveness of BDATP. The authors provide a thorough comparison against several state-of-the-art baselines, demonstrating significant performance improvements in both heard and unheard sound categories. The metrics used (Success Rate, Success weighted by Path Length, and Success weighted by Number of Actions) are appropriate for the task and provide a clear picture of the framework's capabilities.
The paper lacks explicit details on the implementation, such as hyperparameters, training procedures, and code availability, which could hinder reproducibility. While the methodology is described in detail, providing access to the code and models would greatly enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world environments. Additionally, while the proposed methods show strong performance in zero-shot settings, the paper does not address how the framework would perform in dynamic environments with moving sound sources or in multi-agent scenarios.
The BDATP framework has the potential to significantly advance the field of audio-visual navigation, particularly in applications involving robotics and autonomous systems. Its focus on generalization could lead to more robust navigation systems in real-world scenarios, enhancing the capabilities of embodied agents in complex environments. The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.
Primary: unknown
All Institutions: unknown
The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
The methodology is robust, introducing a novel benchmark (FDB-v3) that evaluates spoken language models under realistic conditions, utilizing real human audio annotated for disfluencies. The design incorporates multi-step tool use across various domains, which is a significant advancement over previous benchmarks that relied on synthetic data or single-step tasks. The systematic approach to scenario formulation and audio collection enhances the validity of the evaluation.
The experiments are comprehensive, evaluating six different model configurations across multiple dimensions such as accuracy, latency, and turn-taking dynamics. The results are well-presented, showing clear performance differences among models and highlighting specific strengths and weaknesses, particularly in handling disfluencies and multi-step reasoning. The use of deterministic mock APIs for evaluation is a strong point, ensuring that the results are not confounded by external factors.
The paper provides sufficient detail regarding the experimental setup, including the models evaluated and the evaluation metrics used. However, the lack of specific implementation details or code availability limits reproducibility. The benchmark is open and reproducible, which is a positive aspect, but without access to the models, full replication of results may be challenging.
The study acknowledges limitations, such as the fixed server region for cloud-based evaluations and the lack of robustness testing against real-world network anomalies. Additionally, the dataset is relatively small (100 recordings), which may affect generalizability. The focus on specific disfluency categories may also overlook other potential challenges in real-world interactions.
This work has significant implications for the development of real-time voice agents, particularly in enhancing their ability to handle natural speech disfluencies and multi-step tasks. The findings suggest directions for future research, emphasizing the need for models that can balance speed and accuracy in dynamic conversational contexts. The benchmark itself could facilitate further advancements in the field by providing a standardized evaluation framework. The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
The proposed OmniSonic framework introduces a flow-matching-based diffusion model that effectively integrates video and text to generate comprehensive auditory scenes. The TriAttn-DiT architecture is a notable innovation, allowing simultaneous processing of on-screen environmental sounds, off-screen sounds, and speech conditions. The use of a Mixture-of-Experts (MoE) gating mechanism is a sophisticated approach that enhances the model's adaptability during audio generation. This methodology is well-structured and addresses the limitations of previous models, particularly in generating human speech alongside environmental sounds.
The authors present extensive experiments that demonstrate the superiority of OmniSonic over existing state-of-the-art methods. The creation of the UniHAGen-Bench benchmark, which includes over a thousand samples across diverse scenarios, is a significant contribution that facilitates fair evaluation and comparison in the field. The combination of objective metrics and human evaluations provides a robust assessment of the model's performance, although specific metrics used for evaluation could be elaborated further for clarity.
The paper provides a project page with a URL, but lacks detailed implementation specifics in the text that would enhance reproducibility. While the methodology is sound, the absence of code or detailed experimental setups may hinder other researchers from replicating the results.
One limitation is the lack of detailed discussion on the computational resources required for training the OmniSonic model, which could be a barrier for some researchers. Additionally, while the model excels in generating audio from video and text, its performance in more nuanced or complex auditory environments remains to be fully explored.
The ability to generate holistic audio from multimodal inputs has significant implications for various applications, including film and video production, virtual reality, and assistive technologies for the hearing impaired. The advancements in audio generation could lead to more immersive experiences in entertainment and education, making this research highly relevant to both academic and industry stakeholders. The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Primary: CUHK MMLab
All Institutions: CUHK MMLab
The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
The AURA framework presents a comprehensive end-to-end approach for real-time video understanding and interaction. It effectively integrates context management and data construction, which are crucial for maintaining continuity in long-horizon interactions. The methodology is well-structured, addressing the limitations of existing VideoLLMs by providing a unified model that supports both real-time question answering and proactive responses. The incorporation of ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems at a reasonable frame rate demonstrates a practical application of the proposed methods.
The experiments conducted show that AURA achieves state-of-the-art performance on relevant streaming benchmarks, which is a significant accomplishment. The evaluation metrics used to assess performance should ideally include both subjective and objective measures to provide a comprehensive view of the model's capabilities. However, the paper could benefit from a more detailed breakdown of the datasets used and their characteristics, as well as comparisons with other contemporary systems.
The paper mentions the release of the AURA model and a real-time inference framework, which is a positive step towards reproducibility. However, further details regarding the training process, hyperparameters, and the specific configurations used in experiments would enhance reproducibility efforts. Clear documentation and access to code would be essential for other researchers to replicate the findings.
One limitation is the reliance on specific hardware (80G accelerators) for achieving the reported performance, which may not be accessible to all researchers. Additionally, while the system is designed for real-time interaction, the practical implications of latency and response times in diverse real-world scenarios are not fully explored. The paper could also discuss potential biases in the data or limitations in the model's understanding of complex interactions.
AURA has significant potential applications in various fields, including education, healthcare, and entertainment, where real-time video interaction is valuable. By enabling continuous observation and interaction, it could enhance user experiences in virtual environments and assistive technologies. The release of the model and framework could foster further research and development in real-time video understanding systems. The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.
Primary: Southeast University
All Institutions: Southeast University, Shenzhen Loop Area Institute, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Technical University of Munich, Imperial College London
The paper presents AffectSpeech, a large-scale emotional speech dataset with fine-grained textual descriptions, addressing the limitations of traditional emotion representation methods. The innovative methodology and comprehensive evaluation underscore its potential to advance research in speech emotion recognition and synthesis, making it a valuable resource for the community.
The paper introduces a novel human-LLM collaborative annotation pipeline that enhances the quality and richness of emotional speech data. By integrating algorithmic pre-labeling, multi-LLM description generation, and human verification, the authors effectively address the challenges of annotation scalability and reliability. The dataset's multi-dimensional annotations across sentiment polarity, emotional intensity, prosodic attributes, and semantic content are well-structured, enabling comprehensive modeling of emotional speech. The methodology is innovative and well-articulated, contributing significantly to the field of speech emotion recognition and synthesis.
The experimental results demonstrate the effectiveness of the AffectSpeech dataset in improving the performance of speech emotion captioning and synthesis models. The authors provide thorough evaluations using both objective metrics (e.g., emotion accuracy, prosody accuracy) and subjective assessments (e.g., human preference tests). The results consistently show that models trained on AffectSpeech outperform those trained on existing datasets, validating the dataset's utility. The comprehensive evaluation across multiple models and tasks strengthens the paper's claims about the dataset's impact.
The paper provides detailed descriptions of the dataset construction, annotation process, and experimental setup, which facilitates reproducibility. However, the actual implementation details, such as specific model architectures and training configurations, could be more explicitly outlined to enhance reproducibility further. The availability of the dataset and demo on GitHub is a positive aspect for researchers looking to replicate the study.
While the dataset is extensive and well-annotated, potential limitations include the reliance on human annotators, which may introduce variability in the quality of annotations. Additionally, the dataset is currently limited to English, which may restrict its applicability in multilingual contexts. Future work should consider expanding the dataset to include diverse languages and dialects.
The AffectSpeech dataset has significant implications for various applications, including empathetic conversational agents, affect-aware human-computer interaction systems, and emotional speech synthesis in entertainment and education. By providing a more nuanced representation of emotional speech, it can enhance user experiences in interactive systems and contribute to advancements in affective computing. The paper presents AffectSpeech, a large-scale emotional speech dataset with fine-grained textual descriptions, addressing the limitations of traditional emotion representation methods. The innovative methodology and comprehensive evaluation underscore its potential to advance research in speech emotion recognition and synthesis, making it a valuable resource for the community.
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.
Primary: KDDI Research, Inc.
All Institutions: KDDI Research, Inc.
The main contribution of this paper is the introduction of HSC-MAE, a novel hierarchical framework for unsupervised audio-visual representation learning that effectively addresses the challenges of weakly paired data through a dual-path teacher-student architecture. This work represents a significant step forward in the field, providing a robust methodology that enhances the alignment of audio and visual modalities while demonstrating strong empirical results.
The proposed HSC-MAE framework introduces a dual-path teacher-student architecture that innovatively integrates three levels of semantic correlationโglobal, local, and sample-level. This hierarchical approach is a significant advancement in unsupervised audio-visual representation learning, as it effectively addresses the challenges posed by weakly paired data and spurious co-occurrences. The use of DCCA for global-level alignment and the introduction of teacher-mined soft top-k affinities for local-level correlation are particularly noteworthy, as they enhance the robustness of the learned representations. The methodology is well-structured and demonstrates a clear understanding of the complexities involved in multimodal learning.
The experiments conducted on the AVE and VEGAS datasets provide strong empirical validation of the proposed method. The reported substantial improvements in mean Average Precision (mAP) over existing unsupervised baselines indicate that HSC-MAE is effective in producing high-quality audio-visual embeddings. However, the paper could benefit from a more detailed comparison with state-of-the-art methods and additional qualitative analyses to further substantiate the claims made regarding the quality of the learned representations.
The paper lacks detailed implementation specifics, such as hyperparameter settings, training protocols, and data preprocessing steps, which are crucial for reproducibility. Including a supplementary material section or a dedicated reproducibility appendix would enhance the paper's value and allow other researchers to replicate the results more easily.
One limitation of the study is the reliance on weakly paired data, which may not fully capture the complexity of real-world audio-visual relationships. Additionally, while the proposed method shows promise, it would be beneficial to explore its performance across a wider range of datasets and tasks to assess its generalizability. The paper also does not address potential computational overheads associated with the dual-path architecture, which may limit its applicability in resource-constrained environments.
The HSC-MAE framework has the potential to significantly advance the field of unsupervised learning in audio-visual contexts, with applications in areas such as multimedia content analysis, automated video tagging, and improved human-computer interaction systems. By enhancing the quality of multimodal embeddings, this work could facilitate more sophisticated applications in AI-driven technologies, including virtual reality and augmented reality systems. The main contribution of this paper is the introduction of HSC-MAE, a novel hierarchical framework for unsupervised audio-visual representation learning that effectively addresses the challenges of weakly paired data through a dual-path teacher-student architecture. This work represents a significant step forward in the field, providing a robust methodology that enhances the alignment of audio and visual modalities while demonstrating strong empirical results.
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MALEFA, a lightweight zero-shot keyword spotting framework that effectively reduces false alarms while maintaining high accuracy through innovative multi-granularity contrastive learning and a tailored loss function. This work significantly advances the state of the art in keyword spotting, particularly in resource-constrained environments, and addresses critical challenges in distinguishing similar acoustic keywords.
The proposed MALEFA framework integrates multi-granularity contrastive learning with a novel false alarm-aware loss, which is a significant advancement in the field of zero-shot keyword spotting (ZSKWS). The methodology effectively combines utterance-level and phoneme-level learning objectives, which allows for improved alignment and accuracy in distinguishing acoustically similar keywords. The use of cross-attention mechanisms enhances the model's ability to align audio and text representations, thereby addressing a critical challenge in KWS systems. The design is lightweight, making it suitable for real-time deployment on resource-constrained devices, which is a notable practical consideration.
The experiments conducted on four public benchmark datasets demonstrate the effectiveness of MALEFA, achieving high accuracy (90%) and a remarkably low false alarm rate (0.007%) on the AMI dataset. The ablation studies provide strong evidence for the contributions of each component of the model, confirming that the integration of the proposed loss functions and learning objectives is essential for achieving state-of-the-art performance. The comparisons with existing models highlight MALEFA's robustness and efficiency, making it a competitive solution in the field.
The paper provides sufficient implementation details, including the architecture, training criteria, and experimental setup, which enhances reproducibility. However, the lack of specific citations for some methodologies and datasets may hinder complete reproducibility for external researchers. The use of a GitHub repository for the code is a positive aspect, allowing others to access and verify the implementation.
One limitation of the study is the reliance on specific datasets for evaluation, which may not fully represent the diversity of real-world scenarios in keyword spotting. Additionally, while the model shows promise in reducing false alarms, further exploration of its performance across different languages and accents would be beneficial. The paper also does not address potential biases in the training data, which could affect the model's generalization capabilities.
The MALEFA framework has significant implications for the development of adaptable and personalized voice interfaces, particularly in applications where user-defined keywords are essential. Its lightweight nature makes it suitable for deployment on various devices, including smartphones and smart home assistants, potentially enhancing user experience in everyday interactions. The approach could also pave the way for further research in zero-shot learning and its applications in other domains. The main contribution of this paper is the introduction of MALEFA, a lightweight zero-shot keyword spotting framework that effectively reduces false alarms while maintaining high accuracy through innovative multi-granularity contrastive learning and a tailored loss function. This work significantly advances the state of the art in keyword spotting, particularly in resource-constrained environments, and addresses critical challenges in distinguishing similar acoustic keywords.
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.
Primary: Northwestern Polytechnical University
All Institutions: Nanjing University, Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
The methodology presented in the paper is innovative, leveraging an end-to-end Speech LLM architecture that integrates multi-turn temporal reasoning with a speaker-aware context cache. The iterative global-to-local processing approach is a significant departure from traditional single-pass models, addressing the challenges of overlapping speech and rapid turn-taking effectively. The three-stage progressive training strategy is well-conceived, allowing the model to learn complex interactions and maintain speaker consistency across long-form audio. However, the paper could benefit from a more detailed explanation of the training process and the specific mechanisms used for temporal reasoning.
The experiments are robust, utilizing two well-defined datasets (AliMeeting and AISHELL-4) that reflect real-world challenges in multi-speaker scenarios. The reported results show consistent improvements over strong baselines, particularly in metrics relevant to speaker attribution and transcription accuracy. The use of multiple evaluation metrics (DER, CER, cpCER) provides a comprehensive view of the model's performance. However, the paper lacks a thorough comparison with other state-of-the-art models beyond the immediate baselines, which would strengthen the claims of superiority.
The paper provides sufficient details regarding the model architecture, training procedures, and datasets, which are crucial for reproducibility. The use of established frameworks (e.g., MS-Swift, Megatron-LM) and the clear description of the training stages contribute positively to reproducibility. However, the absence of publicly available code or a demo limits the ease of replication by other researchers.
One limitation of the proposed model is its reliance on the quality of the training data, which may not generalize well to all multi-speaker environments. Additionally, while the speaker-aware cache is a novel approach, it may introduce complexity in managing speaker identities over long recordings. The performance on long-form audio without manual segmentation could also be a concern, as it may not perform as well in highly dynamic environments.
The implications of this research are significant, particularly for applications in meeting transcription, intelligent assistants, and any domain requiring accurate speaker attribution in multi-speaker contexts. The advancements in handling overlapping speech and rapid turn-taking could enhance the usability of speech recognition systems in real-world scenarios, leading to improved accessibility and communication tools. The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
Primary: Ben Gurion University, Be'er Sheva, Israel
All Institutions: Ben Gurion University, University of Haifa
The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
The proposed split-and-conquer framework effectively decomposes the complex task of partial deepfake speech detection into two distinct stages: boundary detection and segment-level classification. This separation allows for a more focused learning objective, enhancing the model's ability to localize manipulated regions accurately. The use of a dedicated boundary detector to identify transition points is a significant methodological innovation, as it reduces the ambiguity and noise typically associated with joint localization and classification tasks. The introduction of a reflection-based multi-length training strategy is also noteworthy, as it generates diverse feature-space representations, improving robustness and performance across various temporal resolutions.
The experiments conducted on the PartialSpoof and Half-Truth datasets demonstrate state-of-the-art performance, showcasing the effectiveness of the proposed method. The results indicate substantial improvements in both detection accuracy and localization capabilities, particularly at stricter evaluation criteria. The comprehensive evaluation across multiple configurations, feature extractors, and augmentation strategies provides a robust assessment of the method's performance, highlighting its generalization capabilities and robustness to boundary estimation errors.
The paper provides detailed descriptions of the experimental setup, including model architectures, training procedures, and evaluation metrics, which enhances reproducibility. The availability of a project repository on GitHub further supports reproducibility efforts, allowing other researchers to replicate the experiments and build upon the proposed framework.
Despite the strengths of the proposed method, there are notable limitations. The reliance on boundary prediction can introduce errors that propagate through the classification stage, particularly in challenging transition regions. Additionally, the assumption that manipulated content can be approximated by piecewise-uniform segments may not fully capture more gradual or subtle manipulations, which could limit the method's applicability in real-world scenarios.
The implications of this research are significant, particularly in the context of security-critical systems that rely on voice-based authentication and speaker verification. The ability to detect partial deepfake speech can enhance the integrity of communication systems and mitigate risks associated with audio deepfakes. Furthermore, the methodological advancements presented in this work may inspire further research in audio forensics and anti-spoofing technologies. The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.
Primary: Martha Stewart Enterprises
All Institutions: Martha Stewart Enterprises, Allied Widgets Research
The main contribution of this paper is the introduction of DynFOA, a novel framework that synthesizes first-order ambisonics from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. This work significantly advances the state of spatial audio generation, addressing critical challenges in modeling complex acoustic environments.
The methodology presented in DynFOA is robust and innovative, integrating conditional diffusion modeling with 3D scene reconstruction to generate first-order ambisonics (FOA) from 360-degree videos. The approach effectively combines sound source localization, depth estimation, semantic segmentation, and material property extraction, which are critical for accurately modeling complex acoustic environments. The use of 3D Gaussian Splatting (3DGS) for scene reconstruction is a notable strength, as it allows for a detailed representation of the environment that informs the audio generation process. The conditional diffusion generator is well-structured, leveraging multimodal features for improved audio synthesis, which is a significant advancement over previous methods that lacked physical grounding.
The experimental evaluation is thorough, with the introduction of the M2G-360 dataset specifically designed to test the model under challenging acoustic conditions. The paper presents a comprehensive set of experiments that demonstrate the superiority of DynFOA over existing methods in terms of spatial accuracy, acoustic fidelity, and user perception metrics. The results are compelling, showing significant improvements in performance metrics such as Direction of Arrival (DOA) estimation and Signal-to-Noise Ratio (SNR), which are critical for validating the model's effectiveness in real-world scenarios.
The paper provides detailed implementation specifics, including the architecture of the model, training protocols, and the datasets used. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. The reliance on a distributed computing cluster for training may also pose challenges for researchers with limited resources.
One limitation of the study is the reliance on a fixed set of HRTFs for binaural rendering, which may not account for individual differences in hearing or head-related transfer functions. Additionally, while the M2G-360 dataset is a significant contribution, it may still not encompass all possible acoustic environments, particularly outdoor settings or highly variable conditions. The model's performance in such scenarios remains to be evaluated.
The implications of this research are substantial, particularly for the fields of virtual reality and immersive media. By enabling the generation of high-fidelity spatial audio that accurately reflects complex acoustic environments, DynFOA has the potential to enhance user experiences in gaming, film, and virtual environments. The methodology could also inspire future research in audio synthesis and multimodal learning, paving the way for more advanced audio-visual integration techniques. The main contribution of this paper is the introduction of DynFOA, a novel framework that synthesizes first-order ambisonics from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. This work significantly advances the state of spatial audio generation, addressing critical challenges in modeling complex acoustic environments.
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.
Primary: & Science University
All Institutions: & Science University, University of Michigan
This paper introduces a novel embedding-first approach to target speech extraction that eliminates the need for enrollment utterances, significantly enhancing the practicality of TSE systems in real-world environments. The methodology is innovative and well-executed, with promising experimental results that demonstrate its potential impact on the field of audio processing.
The paper presents a novel approach to target speech extraction (TSE) by eliminating the need for enrollment utterances, which is a significant limitation in practical applications. The authors propose a multi-speaker embedding encoder that directly maps noisy mixtures to a set of candidate speaker embeddings. This method utilizes permutation-invariant teacher supervision to ensure that the embeddings align with a single-speaker embedding space, thus maintaining structural integrity in the presence of noise and overlapping speech. The methodology is well-structured, leveraging existing frameworks like WavLM while innovating on the embedding extraction process. The use of a teacher-student model for training the embeddings is particularly noteworthy, as it enhances the robustness of the embeddings against noise.
The experimental setup is thorough, utilizing both synthetic datasets (LibriMix) and real-world recordings (DNS Challenge) to evaluate the proposed method. The authors provide a comprehensive set of metrics for assessing the quality of the embeddings and the performance of the TSE systems, including clustering accuracy and standard speech enhancement metrics (SI-SDR, PESQ, STOI). The results demonstrate that the proposed embeddings significantly improve TSE performance compared to traditional methods, indicating the effectiveness of the approach. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize its contributions further.
The paper outlines the architecture and training procedures in sufficient detail, allowing for reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the results fully. Including a link to a GitHub repository or similar would enhance reproducibility and facilitate further research in this area.
One limitation of the study is the focus on a maximum of three speakers, which may not generalize well to environments with a higher number of overlapping speakers. Additionally, while the paper discusses the robustness of the embeddings, it does not extensively address potential failure cases, such as when speakers have similar voice characteristics or when the background noise is particularly challenging.
The proposed method has significant implications for real-world applications in personal audio devices, such as hearing aids and smart speakers, where the ability to isolate a target speaker in noisy environments is crucial. By removing the need for enrollment, the approach enhances usability and accessibility, making it easier for users to interact with technology in everyday situations. The research could also inspire further innovations in multi-speaker systems and applications in areas such as teleconferencing and assistive technologies. This paper introduces a novel embedding-first approach to target speech extraction that eliminates the need for enrollment utterances, significantly enhancing the practicality of TSE systems in real-world environments. The methodology is innovative and well-executed, with promising experimental results that demonstrate its potential impact on the field of audio processing.
Symbolic music generation has made significant progress, yet achieving fine-grained and flexible control over composer style remains challenging. Existing training-based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically support only single-composer generation at a time, limiting their applicability to more creative or blended scenarios. In this work, we propose Composer Vector, an inference-time steering method that operates directly in the model's latent space to control composer style without retraining. Through experiments on multiple symbolic music generation models, we show that Composer Vector effectively guides generations toward target composer styles, enabling smooth and interpretable control through a continuous steering coefficient. It also enables seamless fusion of multiple styles within a unified latent space framework. Overall, our work demonstrates that simple latent space steering provides a practical and general mechanism for controllable symbolic music generation, enabling more flexible and interactive creative workflows. Code and Demo are available here: https://github.com/JiangXunyi/Composer-Vector and https://jiangxunyi.github.io/composervector.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Composer Vector, a novel method for controlling composer style in symbolic music generation through latent-space steering. This work represents a significant advancement in the field, providing a practical and interpretable mechanism for generating music that blends stylistic traits from multiple composers, thereby enhancing creative possibilities in music generation.
The methodology presented in the paper is innovative, focusing on a latent-space steering approach that allows for fine-grained control over composer styles in symbolic music generation. The authors construct a Composer Vector by analyzing the hidden representations of a transformer-based model, allowing for continuous modulation of stylistic features without the need for retraining. This approach is significant as it addresses the limitations of existing methods that require large labeled datasets and are constrained to single-composer generation. The method is well-structured, with clear definitions and a logical flow from hypothesis to implementation.
The experiments conducted are comprehensive, evaluating the effectiveness of the Composer Vector across multiple symbolic music generation models (NotaGen and ChatMusician). The authors utilize both similarity-based and classification-based metrics to assess the performance of their method, demonstrating significant improvements in style control and the ability to perform multi-style fusion. The results are quantitatively supported by clear metrics and visualizations, which strengthen the validity of their claims.
The paper provides sufficient details regarding the implementation of the Composer Vector and the experimental setup, including the datasets used and the evaluation metrics. The inclusion of code and demo links enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings.
One limitation of the study is the reliance on specific symbolic music generation models, which may not generalize to all types of music or other generative frameworks. Additionally, while the method allows for style fusion, the paper does not extensively explore the qualitative aspects of the generated music, which could provide deeper insights into the effectiveness of the Composer Vector in practice.
The proposed method has the potential to significantly impact the field of music generation by enabling more flexible and interactive creative workflows. It opens avenues for artists and composers to explore hybrid styles and enhances the capabilities of music generation systems in educational and entertainment contexts. The implications of this work could extend to applications in music therapy, automated composition, and interactive music systems. The main contribution of this paper is the introduction of Composer Vector, a novel method for controlling composer style in symbolic music generation through latent-space steering. This work represents a significant advancement in the field, providing a practical and interpretable mechanism for generating music that blends stylistic traits from multiple composers, thereby enhancing creative possibilities in music generation.