Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Primary: CUHK MMLab
All Institutions: CUHK MMLab
The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
The AURA framework presents a comprehensive end-to-end approach for real-time video understanding and interaction. It effectively integrates context management and data construction, which are crucial for maintaining continuity in long-horizon interactions. The methodology is well-structured, addressing the limitations of existing VideoLLMs by providing a unified model that supports both real-time question answering and proactive responses. The incorporation of ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems at a reasonable frame rate demonstrates a practical application of the proposed methods.
The experiments conducted show that AURA achieves state-of-the-art performance on relevant streaming benchmarks, which is a significant accomplishment. The evaluation metrics used to assess performance should ideally include both subjective and objective measures to provide a comprehensive view of the model's capabilities. However, the paper could benefit from a more detailed breakdown of the datasets used and their characteristics, as well as comparisons with other contemporary systems.
The paper mentions the release of the AURA model and a real-time inference framework, which is a positive step towards reproducibility. However, further details regarding the training process, hyperparameters, and the specific configurations used in experiments would enhance reproducibility efforts. Clear documentation and access to code would be essential for other researchers to replicate the findings.
One limitation is the reliance on specific hardware (80G accelerators) for achieving the reported performance, which may not be accessible to all researchers. Additionally, while the system is designed for real-time interaction, the practical implications of latency and response times in diverse real-world scenarios are not fully explored. The paper could also discuss potential biases in the data or limitations in the model's understanding of complex interactions.
AURA has significant potential applications in various fields, including education, healthcare, and entertainment, where real-time video interaction is valuable. By enabling continuous observation and interaction, it could enhance user experiences in virtual environments and assistive technologies. The release of the model and framework could foster further research and development in real-time video understanding systems. The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.
Primary: University of Michigan
All Institutions: University of Michigan, University of California, Irvine
The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
The methodology employed in Sona is innovative, utilizing a target-conditioned neural pipeline that allows for real-time attenuation of multiple overlapping sound sources. This is a significant advancement over existing systems that typically focus on single-target noise cancellation. The incorporation of user-extensible sound classes through in-situ examples without the need for retraining is a notable feature that enhances user personalization and adaptability. The formative study involving 68 noise-sensitive individuals provides a solid foundation for understanding user needs and preferences, which is crucial for the design of the system.
The experimental evaluation is robust, featuring both technical benchmarking and an in-situ study with 10 participants. The results demonstrate low-latency performance and effective sound attenuation while preserving desired audio, which is critical for maintaining situational awareness. The use of subjective measures to assess user comfort and soundscape mediation effectiveness adds credibility to the findings. However, the small sample size in the in-situ study may limit the generalizability of the results.
The paper does not provide explicit details regarding the implementation or access to the code, which raises concerns about reproducibility. While the methodology is described, without a publicly available implementation or detailed algorithmic descriptions, it may be challenging for other researchers to replicate the results or build upon this work.
One limitation is the small participant size in the in-situ study, which may not adequately represent the broader population of noise-sensitive individuals. Additionally, while the system allows for user-defined sound classes, the effectiveness of the system in highly dynamic or complex sound environments remains to be fully evaluated. There may also be challenges in the real-world application of the technology, such as varying user preferences and environmental conditions.
The potential applications of Sona are significant, particularly for individuals with noise sensitivity, including those with neurodivergent conditions. By enabling users to manage their auditory environments, Sona could enhance comfort and social participation, leading to improved quality of life. The implications extend beyond personal use, as the technology could be adapted for various settings, including workplaces, educational environments, and public spaces. The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
Primary: QualiaLabs
All Institutions: QualiaLabs
FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
The methodology presented in FastTurn is innovative, combining streaming CTC decoding with acoustic features to enhance turn detection in full-duplex dialogue systems. The architecture is well-structured, comprising three main components that progressively integrate semantic and acoustic cues. The use of a four-stage training pipeline is commendable, as it stabilizes the optimization process and aligns speech and text modalities effectively. However, the reliance on CTC for initial transcription raises concerns about potential error propagation in noisy environments.
The experiments are thorough, utilizing a diverse set of datasets and a comprehensive evaluation framework. The introduction of a new test set with realistic human dialogue scenarios is a significant contribution, allowing for better assessment of the model's performance in practical applications. The results demonstrate that FastTurn outperforms existing baselines in terms of accuracy and latency, underscoring its effectiveness. However, the paper could benefit from additional comparisons with more recent models in the field to contextualize its performance.
The paper provides sufficient details regarding the model architecture, training strategy, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a demo could hinder independent verification of results. Clear instructions for reproducing the experiments would enhance the paper's impact.
One limitation is the potential sensitivity of the model to CTC errors, especially in overlapping speech scenarios. Additionally, while the model shows robustness in various conditions, the performance on English datasets did not meet expectations, indicating a need for further optimization. The paper also does not address the computational resources required for training and inference, which could be a barrier for broader adoption.
The FastTurn framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interaction, such as virtual assistants and customer service bots. By improving turn detection, it can enhance user experience and facilitate more natural conversations. The release of the new dataset also opens avenues for future research in dialogue systems, potentially leading to advancements in multimodal interaction technologies. FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.
Primary: National Taiwan University
All Institutions: National Taiwan University, NVIDIA Taiwan
The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
The paper introduces a novel joint fullband-subband modeling framework, Sing-HiResNet, which effectively captures both global and localized spectral features for high-resolution SingFake detection. The methodology is well-structured, employing a two-phase approach that integrates fullband and subband models, and explores various fusion strategies to enhance detection performance. The use of high-resolution audio (44.1 kHz) is a significant advancement over conventional methods, and the systematic evaluation of subband contributions adds depth to the methodology. However, the paper could benefit from clearer explanations of the fusion strategies and their implications.
The experiments are robust, utilizing the WildSVDD dataset to benchmark the proposed method against existing state-of-the-art systems. The results demonstrate a significant performance improvement over traditional 16 kHz models, achieving a state-of-the-art EER of 1.58%. The comparative analysis of different fusion strategies provides valuable insights into the effectiveness of the proposed approach. However, the paper lacks detailed statistical analysis of the results, which would strengthen the findings.
The paper provides a comprehensive description of the experimental setup, including dataset preparation, model architecture, and training procedures. However, it lacks a public code repository or demo URL, which would enhance reproducibility. The absence of shared resources limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a single dataset (WildSVDD), which may not fully capture the diversity of real-world singing voice deepfakes. Additionally, while the paper discusses various fusion strategies, it does not explore the computational efficiency of these methods, which could be a concern for real-time applications. The authors could also provide more insights into the potential impact of noise and other artifacts in the audio data.
The research addresses a critical issue in the realm of audio synthesis and deepfake detection, with implications for copyright protection, content authenticity, and the broader field of audio forensics. The findings could inform future developments in anti-spoofing technologies and contribute to the establishment of standards for audio quality evaluation in deepfake detection. The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.
Primary: unknown
All Institutions: unknown
The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
The methodology is robust, introducing a novel benchmark (FDB-v3) that evaluates spoken language models under realistic conditions, utilizing real human audio annotated for disfluencies. The design incorporates multi-step tool use across various domains, which is a significant advancement over previous benchmarks that relied on synthetic data or single-step tasks. The systematic approach to scenario formulation and audio collection enhances the validity of the evaluation.
The experiments are comprehensive, evaluating six different model configurations across multiple dimensions such as accuracy, latency, and turn-taking dynamics. The results are well-presented, showing clear performance differences among models and highlighting specific strengths and weaknesses, particularly in handling disfluencies and multi-step reasoning. The use of deterministic mock APIs for evaluation is a strong point, ensuring that the results are not confounded by external factors.
The paper provides sufficient detail regarding the experimental setup, including the models evaluated and the evaluation metrics used. However, the lack of specific implementation details or code availability limits reproducibility. The benchmark is open and reproducible, which is a positive aspect, but without access to the models, full replication of results may be challenging.
The study acknowledges limitations, such as the fixed server region for cloud-based evaluations and the lack of robustness testing against real-world network anomalies. Additionally, the dataset is relatively small (100 recordings), which may affect generalizability. The focus on specific disfluency categories may also overlook other potential challenges in real-world interactions.
This work has significant implications for the development of real-time voice agents, particularly in enhancing their ability to handle natural speech disfluencies and multi-step tasks. The findings suggest directions for future research, emphasizing the need for models that can balance speed and accuracy in dynamic conversational contexts. The benchmark itself could facilitate further advancements in the field by providing a standardized evaluation framework. The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
The proposed OmniSonic framework introduces a flow-matching-based diffusion model that effectively integrates video and text to generate comprehensive auditory scenes. The TriAttn-DiT architecture is a notable innovation, allowing simultaneous processing of on-screen environmental sounds, off-screen sounds, and speech conditions. The use of a Mixture-of-Experts (MoE) gating mechanism is a sophisticated approach that enhances the model's adaptability during audio generation. This methodology is well-structured and addresses the limitations of previous models, particularly in generating human speech alongside environmental sounds.
The authors present extensive experiments that demonstrate the superiority of OmniSonic over existing state-of-the-art methods. The creation of the UniHAGen-Bench benchmark, which includes over a thousand samples across diverse scenarios, is a significant contribution that facilitates fair evaluation and comparison in the field. The combination of objective metrics and human evaluations provides a robust assessment of the model's performance, although specific metrics used for evaluation could be elaborated further for clarity.
The paper provides a project page with a URL, but lacks detailed implementation specifics in the text that would enhance reproducibility. While the methodology is sound, the absence of code or detailed experimental setups may hinder other researchers from replicating the results.
One limitation is the lack of detailed discussion on the computational resources required for training the OmniSonic model, which could be a barrier for some researchers. Additionally, while the model excels in generating audio from video and text, its performance in more nuanced or complex auditory environments remains to be fully explored.
The ability to generate holistic audio from multimodal inputs has significant implications for various applications, including film and video production, virtual reality, and assistive technologies for the hearing impaired. The advancements in audio generation could lead to more immersive experiences in entertainment and education, making this research highly relevant to both academic and industry stakeholders. The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.
Primary: CUHK MMLab
All Institutions: CUHK MMLab
The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
The AURA framework presents a comprehensive end-to-end approach for real-time video understanding and interaction. It effectively integrates context management and data construction, which are crucial for maintaining continuity in long-horizon interactions. The methodology is well-structured, addressing the limitations of existing VideoLLMs by providing a unified model that supports both real-time question answering and proactive responses. The incorporation of ASR (Automatic Speech Recognition) and TTS (Text-to-Speech) systems at a reasonable frame rate demonstrates a practical application of the proposed methods.
The experiments conducted show that AURA achieves state-of-the-art performance on relevant streaming benchmarks, which is a significant accomplishment. The evaluation metrics used to assess performance should ideally include both subjective and objective measures to provide a comprehensive view of the model's capabilities. However, the paper could benefit from a more detailed breakdown of the datasets used and their characteristics, as well as comparisons with other contemporary systems.
The paper mentions the release of the AURA model and a real-time inference framework, which is a positive step towards reproducibility. However, further details regarding the training process, hyperparameters, and the specific configurations used in experiments would enhance reproducibility efforts. Clear documentation and access to code would be essential for other researchers to replicate the findings.
One limitation is the reliance on specific hardware (80G accelerators) for achieving the reported performance, which may not be accessible to all researchers. Additionally, while the system is designed for real-time interaction, the practical implications of latency and response times in diverse real-world scenarios are not fully explored. The paper could also discuss potential biases in the data or limitations in the model's understanding of complex interactions.
AURA has significant potential applications in various fields, including education, healthcare, and entertainment, where real-time video interaction is valuable. By enabling continuous observation and interaction, it could enhance user experiences in virtual environments and assistive technologies. The release of the model and framework could foster further research and development in real-time video understanding systems. The main contribution of this paper is the development of AURA, a novel framework that enables continuous video stream processing for real-time question answering and proactive interaction. This work significantly advances the field of VideoLLMs by addressing key limitations of existing systems and providing a robust platform for future research and applications.
Emotion is essential in spoken communication, yet most existing frameworks in speech emotion modeling rely on predefined categories or low-dimensional continuous attributes, which offer limited expressive capacity. Recent advances in speech emotion captioning and synthesis have shown that textual descriptions provide a more flexible and interpretable alternative for representing affective characteristics in speech. However, progress in this direction is hindered by the lack of an emotional speech dataset aligned with reliable and fine-grained natural language annotations. To tackle this, we introduce AffectSpeech, a large-scale corpus of human-recorded speech enriched with structured descriptions for fine-grained emotion analysis and generation. Each utterance is characterized across six complementary dimensions, including sentiment polarity, open-vocabulary emotion captions, intensity level, prosodic attributes, prominent segments, and semantic content, enabling multi-granular modeling of vocal expression. To balance annotation quality and scalability, we adopt a human-LLM collaborative annotation pipeline that integrates algorithmic pre-labeling, multi-LLM description generation, and human-in-the-loop verification. Furthermore, these annotations are reformulated into diverse descriptive styles to enhance linguistic diversity and reduce stylistic bias in downstream modeling. Experimental results on speech emotion captioning and synthesis demonstrate that models trained on AffectSpeech consistently achieve superior performance across multiple evaluation settings.
Primary: Southeast University
All Institutions: Southeast University, Shenzhen Loop Area Institute, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Technical University of Munich, Imperial College London
The paper presents AffectSpeech, a large-scale emotional speech dataset with fine-grained textual descriptions, addressing the limitations of traditional emotion representation methods. The innovative methodology and comprehensive evaluation underscore its potential to advance research in speech emotion recognition and synthesis, making it a valuable resource for the community.
The paper introduces a novel human-LLM collaborative annotation pipeline that enhances the quality and richness of emotional speech data. By integrating algorithmic pre-labeling, multi-LLM description generation, and human verification, the authors effectively address the challenges of annotation scalability and reliability. The dataset's multi-dimensional annotations across sentiment polarity, emotional intensity, prosodic attributes, and semantic content are well-structured, enabling comprehensive modeling of emotional speech. The methodology is innovative and well-articulated, contributing significantly to the field of speech emotion recognition and synthesis.
The experimental results demonstrate the effectiveness of the AffectSpeech dataset in improving the performance of speech emotion captioning and synthesis models. The authors provide thorough evaluations using both objective metrics (e.g., emotion accuracy, prosody accuracy) and subjective assessments (e.g., human preference tests). The results consistently show that models trained on AffectSpeech outperform those trained on existing datasets, validating the dataset's utility. The comprehensive evaluation across multiple models and tasks strengthens the paper's claims about the dataset's impact.
The paper provides detailed descriptions of the dataset construction, annotation process, and experimental setup, which facilitates reproducibility. However, the actual implementation details, such as specific model architectures and training configurations, could be more explicitly outlined to enhance reproducibility further. The availability of the dataset and demo on GitHub is a positive aspect for researchers looking to replicate the study.
While the dataset is extensive and well-annotated, potential limitations include the reliance on human annotators, which may introduce variability in the quality of annotations. Additionally, the dataset is currently limited to English, which may restrict its applicability in multilingual contexts. Future work should consider expanding the dataset to include diverse languages and dialects.
The AffectSpeech dataset has significant implications for various applications, including empathetic conversational agents, affect-aware human-computer interaction systems, and emotional speech synthesis in entertainment and education. By providing a more nuanced representation of emotional speech, it can enhance user experiences in interactive systems and contribute to advancements in affective computing. The paper presents AffectSpeech, a large-scale emotional speech dataset with fine-grained textual descriptions, addressing the limitations of traditional emotion representation methods. The innovative methodology and comprehensive evaluation underscore its potential to advance research in speech emotion recognition and synthesis, making it a valuable resource for the community.
Learning aligned multimodal embeddings from weakly paired, label-free corpora is challenging: pipelines often provide only pre-extracted features, clips contain multiple events, and spurious co-occurrences. We propose HSC-MAE (Hierarchical Semantic Correlation-Aware Masked Autoencoder), a dual-path teacher-student framework that enforces semantic consistency across three complementary levels of representation - from coarse to fine: (i) global-level canonical-geometry correlation via DCCA, which aligns audio and visual embeddings within a shared modality-invariant subspace; (ii) local-level neighborhood-semantics correlation via teacher-mined soft top-k affinities, which preserves multi-positive relational structure among semantically similar instances; and (iii) sample-level conditional-sufficiency correlation via masked autoencoding, which ensures individual embeddings retain discriminative semantic content under partial observation. Concretely, a student MAE path is trained with masked feature reconstruction and affinity-weighted soft top-k InfoNCE; an EMA teacher operating on unmasked inputs via the CCA path supplies stable canonical geometry and soft positives. Learnable multi-task weights reconcile competing objectives, and an optional distillation loss transfers teacher geometry into the student. Experiments on AVE and VEGAS demonstrate substantial mAP improvements over strong unsupervised baselines, validating that HSC-MAE yields robust and well-structured audio-visual representations.
Primary: KDDI Research, Inc.
All Institutions: KDDI Research, Inc.
The main contribution of this paper is the introduction of HSC-MAE, a novel hierarchical framework for unsupervised audio-visual representation learning that effectively addresses the challenges of weakly paired data through a dual-path teacher-student architecture. This work represents a significant step forward in the field, providing a robust methodology that enhances the alignment of audio and visual modalities while demonstrating strong empirical results.
The proposed HSC-MAE framework introduces a dual-path teacher-student architecture that innovatively integrates three levels of semantic correlation—global, local, and sample-level. This hierarchical approach is a significant advancement in unsupervised audio-visual representation learning, as it effectively addresses the challenges posed by weakly paired data and spurious co-occurrences. The use of DCCA for global-level alignment and the introduction of teacher-mined soft top-k affinities for local-level correlation are particularly noteworthy, as they enhance the robustness of the learned representations. The methodology is well-structured and demonstrates a clear understanding of the complexities involved in multimodal learning.
The experiments conducted on the AVE and VEGAS datasets provide strong empirical validation of the proposed method. The reported substantial improvements in mean Average Precision (mAP) over existing unsupervised baselines indicate that HSC-MAE is effective in producing high-quality audio-visual embeddings. However, the paper could benefit from a more detailed comparison with state-of-the-art methods and additional qualitative analyses to further substantiate the claims made regarding the quality of the learned representations.
The paper lacks detailed implementation specifics, such as hyperparameter settings, training protocols, and data preprocessing steps, which are crucial for reproducibility. Including a supplementary material section or a dedicated reproducibility appendix would enhance the paper's value and allow other researchers to replicate the results more easily.
One limitation of the study is the reliance on weakly paired data, which may not fully capture the complexity of real-world audio-visual relationships. Additionally, while the proposed method shows promise, it would be beneficial to explore its performance across a wider range of datasets and tasks to assess its generalizability. The paper also does not address potential computational overheads associated with the dual-path architecture, which may limit its applicability in resource-constrained environments.
The HSC-MAE framework has the potential to significantly advance the field of unsupervised learning in audio-visual contexts, with applications in areas such as multimedia content analysis, automated video tagging, and improved human-computer interaction systems. By enhancing the quality of multimodal embeddings, this work could facilitate more sophisticated applications in AI-driven technologies, including virtual reality and augmented reality systems. The main contribution of this paper is the introduction of HSC-MAE, a novel hierarchical framework for unsupervised audio-visual representation learning that effectively addresses the challenges of weakly paired data through a dual-path teacher-student architecture. This work represents a significant step forward in the field, providing a robust methodology that enhances the alignment of audio and visual modalities while demonstrating strong empirical results.
User-defined keyword spotting (KWS) without resorting to domain-specific pre-labeled training data is of fundamental importance in building adaptable and personalized voice interfaces. However, such systems are still faced with arduous challenges, including constrained computational resources and limited annotated training data. Existing methods also struggle to distinguish acoustically similar keywords, often leading to a pesky false alarm rate (FAR) in real-world deployments. To mitigate these limitations, we put forward MALEFA, a novel lightweight zero-shot KWS framework that jointly learns utterance- and phoneme-level alignments via cross-attention and a multi-granularity contrastive learning objective. Evaluations on four public benchmark datasets show that MALEFA achieves a high accuracy of 90%, significantly reducing FAR to 0.007% on the AMI dataset. Beyond its strong performance, MALEFA demonstrates high computational efficiency and can readily support real-time deployment on resource-constrained devices.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MALEFA, a lightweight zero-shot keyword spotting framework that effectively reduces false alarms while maintaining high accuracy through innovative multi-granularity contrastive learning and a tailored loss function. This work significantly advances the state of the art in keyword spotting, particularly in resource-constrained environments, and addresses critical challenges in distinguishing similar acoustic keywords.
The proposed MALEFA framework integrates multi-granularity contrastive learning with a novel false alarm-aware loss, which is a significant advancement in the field of zero-shot keyword spotting (ZSKWS). The methodology effectively combines utterance-level and phoneme-level learning objectives, which allows for improved alignment and accuracy in distinguishing acoustically similar keywords. The use of cross-attention mechanisms enhances the model's ability to align audio and text representations, thereby addressing a critical challenge in KWS systems. The design is lightweight, making it suitable for real-time deployment on resource-constrained devices, which is a notable practical consideration.
The experiments conducted on four public benchmark datasets demonstrate the effectiveness of MALEFA, achieving high accuracy (90%) and a remarkably low false alarm rate (0.007%) on the AMI dataset. The ablation studies provide strong evidence for the contributions of each component of the model, confirming that the integration of the proposed loss functions and learning objectives is essential for achieving state-of-the-art performance. The comparisons with existing models highlight MALEFA's robustness and efficiency, making it a competitive solution in the field.
The paper provides sufficient implementation details, including the architecture, training criteria, and experimental setup, which enhances reproducibility. However, the lack of specific citations for some methodologies and datasets may hinder complete reproducibility for external researchers. The use of a GitHub repository for the code is a positive aspect, allowing others to access and verify the implementation.
One limitation of the study is the reliance on specific datasets for evaluation, which may not fully represent the diversity of real-world scenarios in keyword spotting. Additionally, while the model shows promise in reducing false alarms, further exploration of its performance across different languages and accents would be beneficial. The paper also does not address potential biases in the training data, which could affect the model's generalization capabilities.
The MALEFA framework has significant implications for the development of adaptable and personalized voice interfaces, particularly in applications where user-defined keywords are essential. Its lightweight nature makes it suitable for deployment on various devices, including smartphones and smart home assistants, potentially enhancing user experience in everyday interactions. The approach could also pave the way for further research in zero-shot learning and its applications in other domains. The main contribution of this paper is the introduction of MALEFA, a lightweight zero-shot keyword spotting framework that effectively reduces false alarms while maintaining high accuracy through innovative multi-granularity contrastive learning and a tailored loss function. This work significantly advances the state of the art in keyword spotting, particularly in resource-constrained environments, and addresses critical challenges in distinguishing similar acoustic keywords.
Transcribing and understanding multi-speaker conversations requires speech recognition, speaker attribution, and timestamp localization. While speech LLMs excel at single-speaker tasks, multi-speaker scenarios remain challenging due to overlapping speech, backchannels, rapid turn-taking, and context window constraints. We propose Speaker-Reasoner, an end-to-end Speech LLM with agentic multi-turn temporal reasoning. Instead of single-pass inference, the model iteratively analyzes global audio structure, autonomously predicts temporal boundaries, and performs fine-grained segment analysis, jointly modeling speaker identity, gender, timestamps, and transcription. A speaker-aware cache further extends processing to audio exceeding the training context window. Trained with a three-stage progressive strategy, Speaker-Reasoner achieves consistent improvements over strong baselines on AliMeeting and AISHELL-4 datasets, particularly in handling overlapping speech and complex turn-taking.
Primary: Northwestern Polytechnical University
All Institutions: Nanjing University, Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
The methodology presented in the paper is innovative, leveraging an end-to-end Speech LLM architecture that integrates multi-turn temporal reasoning with a speaker-aware context cache. The iterative global-to-local processing approach is a significant departure from traditional single-pass models, addressing the challenges of overlapping speech and rapid turn-taking effectively. The three-stage progressive training strategy is well-conceived, allowing the model to learn complex interactions and maintain speaker consistency across long-form audio. However, the paper could benefit from a more detailed explanation of the training process and the specific mechanisms used for temporal reasoning.
The experiments are robust, utilizing two well-defined datasets (AliMeeting and AISHELL-4) that reflect real-world challenges in multi-speaker scenarios. The reported results show consistent improvements over strong baselines, particularly in metrics relevant to speaker attribution and transcription accuracy. The use of multiple evaluation metrics (DER, CER, cpCER) provides a comprehensive view of the model's performance. However, the paper lacks a thorough comparison with other state-of-the-art models beyond the immediate baselines, which would strengthen the claims of superiority.
The paper provides sufficient details regarding the model architecture, training procedures, and datasets, which are crucial for reproducibility. The use of established frameworks (e.g., MS-Swift, Megatron-LM) and the clear description of the training stages contribute positively to reproducibility. However, the absence of publicly available code or a demo limits the ease of replication by other researchers.
One limitation of the proposed model is its reliance on the quality of the training data, which may not generalize well to all multi-speaker environments. Additionally, while the speaker-aware cache is a novel approach, it may introduce complexity in managing speaker identities over long recordings. The performance on long-form audio without manual segmentation could also be a concern, as it may not perform as well in highly dynamic environments.
The implications of this research are significant, particularly for applications in meeting transcription, intelligent assistants, and any domain requiring accurate speaker attribution in multi-speaker contexts. The advancements in handling overlapping speech and rapid turn-taking could enhance the usability of speech recognition systems in real-world scenarios, leading to improved accessibility and communication tools. The paper presents Speaker-Reasoner, an innovative Speech LLM that effectively addresses the challenges of timestamped speaker-attributed ASR through agentic multi-turn reasoning and a speaker-aware cache. This work significantly advances the state of the art in multi-speaker audio understanding, demonstrating substantial improvements over existing models and offering valuable insights for future research in the field.
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
Primary: Ben Gurion University, Be'er Sheva, Israel
All Institutions: Ben Gurion University, University of Haifa
The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
The proposed split-and-conquer framework effectively decomposes the complex task of partial deepfake speech detection into two distinct stages: boundary detection and segment-level classification. This separation allows for a more focused learning objective, enhancing the model's ability to localize manipulated regions accurately. The use of a dedicated boundary detector to identify transition points is a significant methodological innovation, as it reduces the ambiguity and noise typically associated with joint localization and classification tasks. The introduction of a reflection-based multi-length training strategy is also noteworthy, as it generates diverse feature-space representations, improving robustness and performance across various temporal resolutions.
The experiments conducted on the PartialSpoof and Half-Truth datasets demonstrate state-of-the-art performance, showcasing the effectiveness of the proposed method. The results indicate substantial improvements in both detection accuracy and localization capabilities, particularly at stricter evaluation criteria. The comprehensive evaluation across multiple configurations, feature extractors, and augmentation strategies provides a robust assessment of the method's performance, highlighting its generalization capabilities and robustness to boundary estimation errors.
The paper provides detailed descriptions of the experimental setup, including model architectures, training procedures, and evaluation metrics, which enhances reproducibility. The availability of a project repository on GitHub further supports reproducibility efforts, allowing other researchers to replicate the experiments and build upon the proposed framework.
Despite the strengths of the proposed method, there are notable limitations. The reliance on boundary prediction can introduce errors that propagate through the classification stage, particularly in challenging transition regions. Additionally, the assumption that manipulated content can be approximated by piecewise-uniform segments may not fully capture more gradual or subtle manipulations, which could limit the method's applicability in real-world scenarios.
The implications of this research are significant, particularly in the context of security-critical systems that rely on voice-based authentication and speaker verification. The ability to detect partial deepfake speech can enhance the integrity of communication systems and mitigate risks associated with audio deepfakes. Furthermore, the methodological advancements presented in this work may inspire further research in audio forensics and anti-spoofing technologies. The paper presents a novel split-and-conquer framework for detecting partial deepfake speech, significantly advancing the field of audio deepfake detection through improved localization and classification methodologies. The comprehensive evaluation of the proposed method demonstrates its potential to enhance security in voice-based systems while addressing the challenges posed by partial manipulations in speech.
Spatial audio is crucial for immersive 360-degree video experiences, yet most 360-degree videos lack it due to the difficulty of capturing spatial audio during recording. Automatically generating spatial audio such as first-order ambisonics (FOA) from video therefore remains an important but challenging problem. In complex scenes, sound perception depends not only on sound source locations but also on scene geometry, materials, and dynamic interactions with the environment. However, existing approaches only rely on visual cues and fail to model dynamic sources and acoustic effects such as occlusion, reflections, and reverberation. To address these challenges, we propose DynFOA, a generative framework that synthesizes FOA from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. DynFOA analyzes the input video to detect and localize dynamic sound sources, estimate depth and semantics, and reconstruct scene geometry and materials using 3D Gaussian Splatting (3DGS). The reconstructed scene representation provides physically grounded features that capture acoustic interactions between sources, environment, and listener viewpoint. Conditioned on these features, a diffusion model generates spatial audio consistent with the scene dynamics and acoustic context. We introduce M2G-360, a dataset of 600 real-world clips divided into MoveSources, Multi-Source, and Geometry subsets for evaluating robustness under diverse conditions. Experiments show that DynFOA consistently outperforms existing methods in spatial accuracy, acoustic fidelity, distribution matching, and perceived immersive experience.
Primary: Martha Stewart Enterprises
All Institutions: Martha Stewart Enterprises, Allied Widgets Research
The main contribution of this paper is the introduction of DynFOA, a novel framework that synthesizes first-order ambisonics from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. This work significantly advances the state of spatial audio generation, addressing critical challenges in modeling complex acoustic environments.
The methodology presented in DynFOA is robust and innovative, integrating conditional diffusion modeling with 3D scene reconstruction to generate first-order ambisonics (FOA) from 360-degree videos. The approach effectively combines sound source localization, depth estimation, semantic segmentation, and material property extraction, which are critical for accurately modeling complex acoustic environments. The use of 3D Gaussian Splatting (3DGS) for scene reconstruction is a notable strength, as it allows for a detailed representation of the environment that informs the audio generation process. The conditional diffusion generator is well-structured, leveraging multimodal features for improved audio synthesis, which is a significant advancement over previous methods that lacked physical grounding.
The experimental evaluation is thorough, with the introduction of the M2G-360 dataset specifically designed to test the model under challenging acoustic conditions. The paper presents a comprehensive set of experiments that demonstrate the superiority of DynFOA over existing methods in terms of spatial accuracy, acoustic fidelity, and user perception metrics. The results are compelling, showing significant improvements in performance metrics such as Direction of Arrival (DOA) estimation and Signal-to-Noise Ratio (SNR), which are critical for validating the model's effectiveness in real-world scenarios.
The paper provides detailed implementation specifics, including the architecture of the model, training protocols, and the datasets used. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. The reliance on a distributed computing cluster for training may also pose challenges for researchers with limited resources.
One limitation of the study is the reliance on a fixed set of HRTFs for binaural rendering, which may not account for individual differences in hearing or head-related transfer functions. Additionally, while the M2G-360 dataset is a significant contribution, it may still not encompass all possible acoustic environments, particularly outdoor settings or highly variable conditions. The model's performance in such scenarios remains to be evaluated.
The implications of this research are substantial, particularly for the fields of virtual reality and immersive media. By enabling the generation of high-fidelity spatial audio that accurately reflects complex acoustic environments, DynFOA has the potential to enhance user experiences in gaming, film, and virtual environments. The methodology could also inspire future research in audio synthesis and multimodal learning, paving the way for more advanced audio-visual integration techniques. The main contribution of this paper is the introduction of DynFOA, a novel framework that synthesizes first-order ambisonics from 360-degree videos by integrating dynamic scene reconstruction with conditional diffusion modeling. This work significantly advances the state of spatial audio generation, addressing critical challenges in modeling complex acoustic environments.
Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.
Primary: & Science University
All Institutions: & Science University, University of Michigan
This paper introduces a novel embedding-first approach to target speech extraction that eliminates the need for enrollment utterances, significantly enhancing the practicality of TSE systems in real-world environments. The methodology is innovative and well-executed, with promising experimental results that demonstrate its potential impact on the field of audio processing.
The paper presents a novel approach to target speech extraction (TSE) by eliminating the need for enrollment utterances, which is a significant limitation in practical applications. The authors propose a multi-speaker embedding encoder that directly maps noisy mixtures to a set of candidate speaker embeddings. This method utilizes permutation-invariant teacher supervision to ensure that the embeddings align with a single-speaker embedding space, thus maintaining structural integrity in the presence of noise and overlapping speech. The methodology is well-structured, leveraging existing frameworks like WavLM while innovating on the embedding extraction process. The use of a teacher-student model for training the embeddings is particularly noteworthy, as it enhances the robustness of the embeddings against noise.
The experimental setup is thorough, utilizing both synthetic datasets (LibriMix) and real-world recordings (DNS Challenge) to evaluate the proposed method. The authors provide a comprehensive set of metrics for assessing the quality of the embeddings and the performance of the TSE systems, including clustering accuracy and standard speech enhancement metrics (SI-SDR, PESQ, STOI). The results demonstrate that the proposed embeddings significantly improve TSE performance compared to traditional methods, indicating the effectiveness of the approach. However, the paper could benefit from more detailed comparisons with a broader range of existing methods to contextualize its contributions further.
The paper outlines the architecture and training procedures in sufficient detail, allowing for reproducibility. However, the lack of publicly available code or datasets limits the ability of other researchers to replicate the results fully. Including a link to a GitHub repository or similar would enhance reproducibility and facilitate further research in this area.
One limitation of the study is the focus on a maximum of three speakers, which may not generalize well to environments with a higher number of overlapping speakers. Additionally, while the paper discusses the robustness of the embeddings, it does not extensively address potential failure cases, such as when speakers have similar voice characteristics or when the background noise is particularly challenging.
The proposed method has significant implications for real-world applications in personal audio devices, such as hearing aids and smart speakers, where the ability to isolate a target speaker in noisy environments is crucial. By removing the need for enrollment, the approach enhances usability and accessibility, making it easier for users to interact with technology in everyday situations. The research could also inspire further innovations in multi-speaker systems and applications in areas such as teleconferencing and assistive technologies. This paper introduces a novel embedding-first approach to target speech extraction that eliminates the need for enrollment utterances, significantly enhancing the practicality of TSE systems in real-world environments. The methodology is innovative and well-executed, with promising experimental results that demonstrate its potential impact on the field of audio processing.
Symbolic music generation has made significant progress, yet achieving fine-grained and flexible control over composer style remains challenging. Existing training-based methods for composer style conditioning depend on large labeled datasets. Besides, these methods typically support only single-composer generation at a time, limiting their applicability to more creative or blended scenarios. In this work, we propose Composer Vector, an inference-time steering method that operates directly in the model's latent space to control composer style without retraining. Through experiments on multiple symbolic music generation models, we show that Composer Vector effectively guides generations toward target composer styles, enabling smooth and interpretable control through a continuous steering coefficient. It also enables seamless fusion of multiple styles within a unified latent space framework. Overall, our work demonstrates that simple latent space steering provides a practical and general mechanism for controllable symbolic music generation, enabling more flexible and interactive creative workflows. Code and Demo are available here: https://github.com/JiangXunyi/Composer-Vector and https://jiangxunyi.github.io/composervector.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Composer Vector, a novel method for controlling composer style in symbolic music generation through latent-space steering. This work represents a significant advancement in the field, providing a practical and interpretable mechanism for generating music that blends stylistic traits from multiple composers, thereby enhancing creative possibilities in music generation.
The methodology presented in the paper is innovative, focusing on a latent-space steering approach that allows for fine-grained control over composer styles in symbolic music generation. The authors construct a Composer Vector by analyzing the hidden representations of a transformer-based model, allowing for continuous modulation of stylistic features without the need for retraining. This approach is significant as it addresses the limitations of existing methods that require large labeled datasets and are constrained to single-composer generation. The method is well-structured, with clear definitions and a logical flow from hypothesis to implementation.
The experiments conducted are comprehensive, evaluating the effectiveness of the Composer Vector across multiple symbolic music generation models (NotaGen and ChatMusician). The authors utilize both similarity-based and classification-based metrics to assess the performance of their method, demonstrating significant improvements in style control and the ability to perform multi-style fusion. The results are quantitatively supported by clear metrics and visualizations, which strengthen the validity of their claims.
The paper provides sufficient details regarding the implementation of the Composer Vector and the experimental setup, including the datasets used and the evaluation metrics. The inclusion of code and demo links enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings.
One limitation of the study is the reliance on specific symbolic music generation models, which may not generalize to all types of music or other generative frameworks. Additionally, while the method allows for style fusion, the paper does not extensively explore the qualitative aspects of the generated music, which could provide deeper insights into the effectiveness of the Composer Vector in practice.
The proposed method has the potential to significantly impact the field of music generation by enabling more flexible and interactive creative workflows. It opens avenues for artists and composers to explore hybrid styles and enhances the capabilities of music generation systems in educational and entertainment contexts. The implications of this work could extend to applications in music therapy, automated composition, and interactive music systems. The main contribution of this paper is the introduction of Composer Vector, a novel method for controlling composer style in symbolic music generation through latent-space steering. This work represents a significant advancement in the field, providing a practical and interpretable mechanism for generating music that blends stylistic traits from multiple composers, thereby enhancing creative possibilities in music generation.
Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
Primary: QualiaLabs
All Institutions: QualiaLabs
FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
The methodology presented in FastTurn is innovative, combining streaming CTC decoding with acoustic features to enhance turn detection in full-duplex dialogue systems. The architecture is well-structured, comprising three main components that progressively integrate semantic and acoustic cues. The use of a four-stage training pipeline is commendable, as it stabilizes the optimization process and aligns speech and text modalities effectively. However, the reliance on CTC for initial transcription raises concerns about potential error propagation in noisy environments.
The experiments are thorough, utilizing a diverse set of datasets and a comprehensive evaluation framework. The introduction of a new test set with realistic human dialogue scenarios is a significant contribution, allowing for better assessment of the model's performance in practical applications. The results demonstrate that FastTurn outperforms existing baselines in terms of accuracy and latency, underscoring its effectiveness. However, the paper could benefit from additional comparisons with more recent models in the field to contextualize its performance.
The paper provides sufficient details regarding the model architecture, training strategy, and evaluation metrics, which aids in reproducibility. However, the absence of publicly available code or a demo could hinder independent verification of results. Clear instructions for reproducing the experiments would enhance the paper's impact.
One limitation is the potential sensitivity of the model to CTC errors, especially in overlapping speech scenarios. Additionally, while the model shows robustness in various conditions, the performance on English datasets did not meet expectations, indicating a need for further optimization. The paper also does not address the computational resources required for training and inference, which could be a barrier for broader adoption.
The FastTurn framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interaction, such as virtual assistants and customer service bots. By improving turn detection, it can enhance user experience and facilitate more natural conversations. The release of the new dataset also opens avenues for future research in dialogue systems, potentially leading to advancements in multimodal interaction technologies. FastTurn presents a unified framework for low-latency and robust turn detection in full-duplex dialogue systems. The technical contributions, particularly in integrating acoustic and semantic cues, represent a meaningful advancement in the field of audio processing and dialogue systems, with potential applications in various real-time communication scenarios.
We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram-domain enhancement, providing complementary cues. Outputs from both branches are fused by a post-processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative-predictive fusion improves robustness and perceptual quality, achieving top performance in the blind-test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin-rong.github.io/gap-urgenet_demo.
Primary: Nanjing University
All Institutions: Nanjing University
The main contribution of this paper is the introduction of GAP-URGENet, a novel generative-predictive fusion framework for universal speech enhancement that demonstrates state-of-the-art performance in the ICASSP 2026 URGENT Challenge. This work significantly advances the field of speech enhancement by effectively integrating generative and predictive methodologies, providing a comprehensive solution to improve speech quality across diverse conditions.
The methodology presented in GAP-URGENet is innovative, combining generative and predictive models to enhance speech quality effectively. The generative branch focuses on full-stack speech restoration using self-supervised learning, while the predictive branch enhances the spectrogram domain, allowing for complementary improvements. The fusion of outputs from both branches through a post-processing module is a significant contribution, particularly the bandwidth extension to achieve high-quality waveforms. The architecture is well-structured, leveraging existing models like DeWavLM and TF-GridNet, which indicates a thoughtful integration of prior work with novel enhancements.
The experimental setup is robust, utilizing comprehensive datasets from the URGENT Challenge, which enhances the credibility of the results. The paper reports substantial improvements over baseline models, with detailed metrics provided for various objective evaluations (DNSMOS, NISQA, UTMOS, etc.), showcasing the effectiveness of the proposed framework. The results indicate that GAP-URGENet achieves superior performance in both objective and subjective evaluations, validating the proposed approach.
The paper provides sufficient details regarding the architecture, training process, and datasets used, which facilitates reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers. Including a link to the code or detailed implementation instructions would enhance reproducibility significantly.
While the paper demonstrates impressive results, it does not address potential limitations such as the computational cost of the model, the need for extensive training data, or the model's performance in real-world applications outside the challenge context. Additionally, the reliance on specific architectures may limit generalizability to other tasks or domains.
The implications of this research extend to various applications in speech enhancement, including telecommunications, assistive technologies for the hearing impaired, and voice recognition systems. By improving speech quality in challenging conditions, the framework can enhance user experience across multiple platforms, making it a valuable contribution to the field of audio processing. The main contribution of this paper is the introduction of GAP-URGENet, a novel generative-predictive fusion framework for universal speech enhancement that demonstrates state-of-the-art performance in the ICASSP 2026 URGENT Challenge. This work significantly advances the field of speech enhancement by effectively integrating generative and predictive methodologies, providing a comprehensive solution to improve speech quality across diverse conditions.
Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet's interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.
Primary: National University of Singapore
All Institutions: National University of Singapore, Shenzhen Loop Area Institute, Nanjing University, Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong
The paper presents PhiNet, a self-interpretable speaker verification network that enhances transparency in decision-making by leveraging phonetic evidence. This contribution is significant as it addresses the critical need for interpretability in automatic speaker verification systems, bridging the gap between ASV and forensic speaker comparison.
The proposed PhiNet framework introduces a novel approach to speaker verification by integrating phonetic interpretability into the decision-making process. The architecture is designed to provide both local and global interpretability, allowing users to understand the contribution of individual phonemes to the verification score. This is achieved through a phonetic trait extractor and a decision layer that weights phonetic contributions based on their distinctiveness. The methodology is well-structured, leveraging existing neural network techniques while innovatively adapting them to enhance interpretability in ASV systems.
The experiments conducted on benchmark datasets such as VoxCeleb, SITW, and LibriSpeech demonstrate that PhiNet achieves competitive performance compared to traditional black-box ASV models. The evaluation metrics, including equal error rate (EER) and minimum detection cost function (minDCF), provide a solid basis for performance comparison. Additionally, the paper includes qualitative assessments of interpretability through visualizations and leave-$i$th-phoneme-out experiments, which substantiate the claims of enhanced interpretability.
The authors provide a GitHub repository with the code for PhiNet, which is essential for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameter settings and data preprocessing steps, to facilitate easier reproduction of results by other researchers.
One limitation is the potential for cognitive bias in phoneme weighting, which could affect the model's interpretability and robustness. Additionally, while the framework shows promise, the reliance on phonetic traits may limit its generalizability to diverse speaker populations or languages not represented in the training data. The paper also does not address the computational complexity of the model, which may hinder real-time applications.
The integration of phonetic interpretability into ASV systems has significant implications for high-accountability applications, such as forensic analysis and security. By providing interpretable results, PhiNet can enhance user trust in automated systems and facilitate error tracing in speaker verification tasks. This work could pave the way for more transparent AI systems in sensitive applications, contributing positively to the field of machine learning and audio processing. The paper presents PhiNet, a self-interpretable speaker verification network that enhances transparency in decision-making by leveraging phonetic evidence. This contribution is significant as it addresses the critical need for interpretability in automatic speaker verification systems, bridging the gap between ASV and forensic speaker comparison.
Recent ECG--language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment. In this paper, we propose \textbf{SCAR}, a robust ECG--language pretraining framework for \textbf{S}emantic \textbf{C}ompensation via \textbf{A}dversarial \textbf{R}emoval. SCAR improves robustness by explicitly training the model to remain semantically aligned with semantically critical missingness and to recover diagnostic meaning from the remaining visible evidence. Specifically, we introduce a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn representations that remain semantically aligned with clinical text even when primary diagnostic evidence is missing. Under such adversarial corruption, we equip the ECG encoder with a semantically supervised adaptive selector that learns to reweight the remaining visible tokens and compensate with secondary yet diagnostically informative morphological cues. To evaluate robustness beyond classification accuracy, we further introduce Counterfactual Missingness Resolution Score (CMRS), which quantifies how well feature preserve diagnostic semantics under missingness. Experiments on $6$ datasets show that SCAR consistently improves semantic robustness under joint lead and temporal missingness, with particularly clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.
Primary: University of Science and Technology Beijing
All Institutions: School of Intelligence Science and Technology, School of Computer and Communication Engineering, University of Science and Technology Beijing
The paper presents SCAR, a robust ECG--language pretraining framework that enhances zero-shot ECG diagnosis by explicitly addressing the challenges posed by missing data through innovative adversarial techniques. The methodology and results contribute meaningfully to the field of machine learning in healthcare, particularly in improving the robustness of diagnostic models under real-world conditions.
The proposed SCAR framework introduces a novel approach to address the challenge of missing ECG data during zero-shot diagnosis by employing adversarial masking to force the model to learn robust representations. The methodology is well-structured, utilizing a differentiable adversarial masker and a semantically supervised adaptive selector, which collectively enhance the model's ability to maintain semantic alignment even under partial observation. The introduction of the Counterfactual Missingness Resolution Score (CMRS) as a metric for evaluating robustness adds significant value to the methodology, allowing for a more nuanced assessment of performance under missingness.
The experiments are comprehensive, utilizing six datasets to validate the effectiveness of SCAR against existing baselines. The results demonstrate significant improvements in both zero-shot classification performance and robustness under various missingness scenarios, particularly highlighting the advantages of the proposed methods in harder cases where primary diagnostic evidence is absent. The ablation studies effectively illustrate the contributions of each component of the framework, reinforcing the robustness of the findings.
The paper provides sufficient implementation details, including training protocols, dataset descriptions, and evaluation metrics, which support reproducibility. However, the absence of a publicly available code repository or demo limits the ease of reproduction for external researchers.
One limitation is the reliance on specific datasets for training and evaluation, which may affect the generalizability of the results to other ECG datasets or clinical settings. Additionally, while the proposed methods show improvements, the paper does not extensively discuss the computational costs associated with the adversarial masking and adaptive selection processes during training.
The implications of this work are significant for clinical practice, as it addresses a common issue in ECG analysis—missing data due to various artifacts. The ability to maintain diagnostic accuracy under such conditions can enhance the reliability of ECG-based diagnoses in real-world scenarios, potentially leading to better patient outcomes. The framework could also inspire further research in robust multimodal learning across other medical domains. The paper presents SCAR, a robust ECG--language pretraining framework that enhances zero-shot ECG diagnosis by explicitly addressing the challenges posed by missing data through innovative adversarial techniques. The methodology and results contribute meaningfully to the field of machine learning in healthcare, particularly in improving the robustness of diagnostic models under real-world conditions.
Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, Graduate School of Engineering, Third Intelligence, Matsuo Institute, Department of Technology Management for Innovation
The main contribution of this work is the development of T5Gemma-TTS, a novel encoder-decoder model that enhances multilingual zero-shot text-to-speech synthesis through innovative architectural improvements and rigorous experimental validation. This research represents a meaningful advancement in the field of speech synthesis, addressing key challenges and setting a foundation for future exploration in multilingual and cross-lingual applications.
The paper introduces T5Gemma-TTS, an encoder-decoder model that effectively addresses the limitations of autoregressive decoder-only architectures by maintaining persistent text conditioning through cross-attention mechanisms. The integration of Progress-Monitoring Rotary Position Embedding (PM-RoPE) is a significant methodological advancement, allowing for improved duration control during speech synthesis. The model's architecture is well-founded on the T5Gemma pretrained backbone, which enhances its linguistic capabilities without requiring phoneme conversion. The methodology is robust and clearly articulated, demonstrating a thoughtful approach to overcoming existing challenges in zero-shot TTS.
The experimental evaluation is comprehensive, involving a substantial training dataset of 170,000 hours of multilingual speech. The results indicate statistically significant improvements in speaker similarity and character error rates compared to existing models. The paper provides detailed comparisons against multiple baselines, showcasing the model's effectiveness across different languages, including Japanese, Chinese, and Korean. The use of confidence intervals adds rigor to the statistical claims, although some results should be interpreted cautiously due to overlapping intervals.
The authors have made the model weights and code publicly available, which is a positive step towards reproducibility. However, the paper would benefit from more detailed implementation specifics and hyperparameter settings to facilitate easier replication of results by other researchers.
The paper acknowledges several limitations, including higher word error rates on unseen European languages and a real-time factor that may not meet the demands of real-time applications. Additionally, the authors note that the model's performance on certain metrics may be influenced by the codec's limitations, indicating areas for future improvement.
The potential for misuse of zero-shot voice cloning technology is a significant concern, as highlighted by the authors. They emphasize the need for ethical considerations and safeguards in deploying such technologies, which is crucial given the implications for privacy and security. The authors advocate for responsible use and further research into detection methods for synthetic speech. The main contribution of this work is the development of T5Gemma-TTS, a novel encoder-decoder model that enhances multilingual zero-shot text-to-speech synthesis through innovative architectural improvements and rigorous experimental validation. This research represents a meaningful advancement in the field of speech synthesis, addressing key challenges and setting a foundation for future exploration in multilingual and cross-lingual applications.
Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model's speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model's cross-linguistic robustness, not only supports that the CDMA framework's approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.
Primary: Zhejiang University
All Institutions: Zhejiang University, Università della Campania “Luigi Vanvitelli”, UKRI, EPSRC, National Natural Science Foundation of China, State Key Laboratory of Brain-Machine Intelligence
This study provides a novel approach to depression detection by integrating speech analysis and neurophysiological validation, demonstrating the critical role of emotional arousal over valence in enhancing detection performance. The methodology and results contribute significantly to the field of computational mental health, offering a framework that is both innovative and applicable across linguistic boundaries.
The paper employs a robust methodology by extending the Cross-Data Multilevel Attention (CDMA) framework to a new linguistic context (Chinese Mandarin) and integrating EEG data for neurophysiological validation. The fusion of read and spontaneous speech across emotional valences is a significant methodological advancement, allowing for a nuanced understanding of emotional arousal in depression detection. The attention mechanisms used are well-justified and effectively enhance the model's performance.
The experiments are comprehensive, utilizing a well-defined dataset (MODMA) and employing rigorous cross-validation techniques. The reported F1-scores (up to 89.6%) demonstrate state-of-the-art performance, and the inclusion of EEG analysis adds a layer of validation that strengthens the findings. The statistical comparisons between different emotional contexts and their impact on detection performance are well-articulated.
The paper provides detailed descriptions of the data acquisition, preprocessing, and model training processes, which supports reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
The study acknowledges limitations such as the modest sample size for EEG recordings and the correlational nature of the findings, which precludes causal inferences. Additionally, the lack of information regarding participants' medication status and comorbidities could influence the results.
The findings have significant implications for clinical practices in mental health, particularly in developing objective diagnostic tools for depression that can be applied across different languages. The neurophysiological validation of speech-based models could pave the way for more interpretable and trustworthy AI systems in mental health assessment. This study provides a novel approach to depression detection by integrating speech analysis and neurophysiological validation, demonstrating the critical role of emotional arousal over valence in enhancing detection performance. The methodology and results contribute significantly to the field of computational mental health, offering a framework that is both innovative and applicable across linguistic boundaries.
Audio-Visual Navigation (AVN) requires an embodied agent to navigate toward a sound source by utilizing both vision and binaural audio. A core challenge arises in complex acoustic environments, where binaural cues become intermittently unreliable, particularly when generalizing to previously unheard sound categories. To address this, we propose RAVN (Reliability-Aware Audio-Visual Navigation), a framework that conditions cross-modal fusion on audio-derived reliability cues, dynamically calibrating the integration of audio and visual inputs. RAVN introduces an Acoustic Geometry Reasoner (AGR) that is trained with geometric proxy supervision. Using a heteroscedastic Gaussian NLL objective, AGR learns observation-dependent dispersion as a practical reliability cue, eliminating the need for geometric labels during inference. Additionally, we introduce Reliability-Aware Geometric Modulation (RAGM), which converts the learned cue into a soft gate to modulate visual features, thereby mitigating cross-modal conflicts. We evaluate RAVN on SoundSpaces using both Replica and Matterport3D environments, and the results show consistent improvements in navigation performance, with notable robustness in the challenging unheard sound setting.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The paper presents a significant advancement in audio-visual navigation through the introduction of RAVN, a reliability-aware framework that enhances navigation performance in complex acoustic environments. The methodology is innovative, and the empirical results demonstrate its effectiveness, marking a meaningful contribution to the field of embodied AI and multimodal integration.
The paper introduces a novel framework, RAVN, that effectively integrates reliability-aware geometric fusion for audio-visual navigation. The methodology is well-structured, leveraging an Acoustic Geometry Reasoner (AGR) to derive reliability cues from audio inputs and a Reliability-Aware Geometric Modulation (RAGM) mechanism to adaptively gate visual features based on these cues. This approach is innovative in its use of heteroscedastic Gaussian NLL objectives to model uncertainty, which is a significant advancement over traditional static fusion methods. The design is theoretically sound and aligns well with human-like decision-making processes in ambiguous auditory environments.
The experimental setup is robust, utilizing two well-known datasets (Replica and Matterport3D) that provide a comprehensive evaluation of the proposed method's performance. The results demonstrate significant improvements in navigation success rates, particularly in challenging scenarios involving unheard sounds. The quantitative metrics (Success Rate, Success weighted by Path Length, and Success weighted by Number of Actions) are appropriate and effectively illustrate the advantages of the proposed method over existing baselines. Qualitative results further support the claims, showing improved trajectory following and decision-making stability.
The paper provides sufficient detail regarding the experimental setup, including training protocols, dataset descriptions, and evaluation metrics. However, the absence of a publicly available code repository limits full reproducibility. Future work should consider releasing the code and models to facilitate further research and validation of the findings.
One limitation is the reliance on simulated environments, which may not fully capture the complexities of real-world acoustic conditions. Additionally, while the framework shows promise, its performance in extremely noisy or dynamic environments remains untested. The paper also does not address potential computational overhead introduced by the reliability-aware mechanisms, which could affect real-time applications.
The proposed framework has significant implications for the development of more robust embodied agents capable of navigating complex environments. Applications could extend to robotics, autonomous vehicles, and assistive technologies, enhancing their ability to operate in real-world scenarios where audio-visual cues are unreliable. The focus on reliability-aware fusion could lead to advancements in human-robot interaction and improve the safety and efficiency of autonomous systems. The paper presents a significant advancement in audio-visual navigation through the introduction of RAVN, a reliability-aware framework that enhances navigation performance in complex acoustic environments. The methodology is innovative, and the empirical results demonstrate its effectiveness, marking a meaningful contribution to the field of embodied AI and multimodal integration.
While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Czech Science Foundation, e-INFRA CZ project, Ministry of Education, Youth and Sports of the Czech Republic
The paper presents a novel multi-objective evolutionary framework for fusing deepfake speech detectors, achieving state-of-the-art performance while significantly reducing system complexity. This work is a substantial contribution to the field of audio machine learning, providing a comprehensive approach to tackle the challenges posed by deepfake technologies.
The paper introduces an innovative multi-objective evolutionary framework for fusing deepfake speech detectors using NSGA-II, addressing the critical balance between detection accuracy and system complexity. It explores two encoding strategies—binary-coded detector selection and real-valued weight optimization—demonstrating a systematic approach to ensemble learning that is both effective and efficient. The methodology is well-structured, leveraging evolutionary algorithms to navigate the trade-offs inherent in deepfake detection.
The authors conduct extensive experiments on the ASVspoof 5 dataset, utilizing a diverse pool of 36 SSL-based detectors. The results are robust, showcasing the superiority of the proposed methods over traditional fusion techniques, including simple averaging and logistic regression. The achieved EER of 2.37% indicates a significant performance improvement while reducing system complexity, underscoring the effectiveness of the proposed approach.
The paper provides detailed implementation information, including parameter settings, computational resources, and the use of a GitHub repository for code access. The thoroughness of the experimental setup and the availability of the code enhance the reproducibility of the results, allowing other researchers to validate and build upon this work.
While the proposed method is effective, it is limited by its reliance on score-level fusion, which may overlook deeper interactions that could be exploited through joint fine-tuning of the models. Additionally, the performance is constrained by the quality of the underlying detectors, suggesting that optimizing these base models could further enhance the fusion outcomes.
This research has significant implications for the field of deepfake detection, particularly in enhancing the robustness and efficiency of voice biometric systems. The ability to balance performance and complexity in detector fusion can lead to more practical applications in security and authentication, addressing the growing concerns surrounding deepfake technology. The paper presents a novel multi-objective evolutionary framework for fusing deepfake speech detectors, achieving state-of-the-art performance while significantly reducing system complexity. This work is a substantial contribution to the field of audio machine learning, providing a comprehensive approach to tackle the challenges posed by deepfake technologies.
For people with noise sensitivity, everyday soundscapes can be overwhelming. Existing tools such as active noise cancellation reduce discomfort by suppressing the entire acoustic environment, often at the cost of awareness of surrounding people and events. We present Sona, an interactive mobile system for real-time soundscape mediation that selectively attenuates bothersome sounds while preserving desired audio. Sona is built on a target-conditioned neural pipeline that supports simultaneous attenuation of multiple overlapping sound sources, overcoming the single-target limitation of prior systems. It runs in real time on-device and supports user-extensible sound classes through in-situ audio examples, without retraining. Sona is informed by a formative study with 68 noise-sensitive individuals. Through technical benchmarking and an in-situ study with 10 participants, we show that Sona achieves low-latency, multi-target attenuation suitable for live listening, and enables meaningful reductions in bothersome sounds while maintaining awareness of surroundings. These results point toward a new class of personal AI systems that support comfort and social participation by mediating real-world acoustic environments.
Primary: University of Michigan
All Institutions: University of Michigan, University of California, Irvine
The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
The methodology employed in Sona is innovative, utilizing a target-conditioned neural pipeline that allows for real-time attenuation of multiple overlapping sound sources. This is a significant advancement over existing systems that typically focus on single-target noise cancellation. The incorporation of user-extensible sound classes through in-situ examples without the need for retraining is a notable feature that enhances user personalization and adaptability. The formative study involving 68 noise-sensitive individuals provides a solid foundation for understanding user needs and preferences, which is crucial for the design of the system.
The experimental evaluation is robust, featuring both technical benchmarking and an in-situ study with 10 participants. The results demonstrate low-latency performance and effective sound attenuation while preserving desired audio, which is critical for maintaining situational awareness. The use of subjective measures to assess user comfort and soundscape mediation effectiveness adds credibility to the findings. However, the small sample size in the in-situ study may limit the generalizability of the results.
The paper does not provide explicit details regarding the implementation or access to the code, which raises concerns about reproducibility. While the methodology is described, without a publicly available implementation or detailed algorithmic descriptions, it may be challenging for other researchers to replicate the results or build upon this work.
One limitation is the small participant size in the in-situ study, which may not adequately represent the broader population of noise-sensitive individuals. Additionally, while the system allows for user-defined sound classes, the effectiveness of the system in highly dynamic or complex sound environments remains to be fully evaluated. There may also be challenges in the real-world application of the technology, such as varying user preferences and environmental conditions.
The potential applications of Sona are significant, particularly for individuals with noise sensitivity, including those with neurodivergent conditions. By enabling users to manage their auditory environments, Sona could enhance comfort and social participation, leading to improved quality of life. The implications extend beyond personal use, as the technology could be adapted for various settings, including workplaces, educational environments, and public spaces. The main contribution of this paper is the development of Sona, an interactive mobile system that enables real-time, multi-target sound attenuation for individuals with noise sensitivity. This work represents a meaningful advancement in audio processing and accessibility technology, with the potential to significantly improve the daily experiences of users in noisy environments.
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Primary: unknown
All Institutions: unknown
FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
The methodology presented in FineLAP is innovative, addressing the challenge of heterogeneous supervision in audio-language models. The introduction of a dual-stream sigmoid loss and a decoupled audio projector allows for effective learning from both clip- and frame-level annotations. This approach is well-justified, as it leverages the strengths of existing models while introducing novel components that enhance performance across various tasks. The use of cluster-based sampling for negative phrases is particularly noteworthy, as it mitigates the scarcity of frame-level annotations and improves the model's ability to generalize.
The experiments conducted are extensive and demonstrate the effectiveness of FineLAP across multiple audio understanding tasks, achieving state-of-the-art results. The evaluation includes a variety of benchmarks, and the ablation studies provide clear insights into the contributions of each component of the model. The results are compelling, showing significant improvements over existing methods, particularly in sound event detection and audio-text retrieval.
The paper provides sufficient implementation details, including training parameters and dataset descriptions, which are crucial for reproducibility. The authors also commit to releasing the code and dataset, which enhances the potential for other researchers to replicate and build upon their work.
Despite its strengths, FineLAP has limitations, such as its inability to handle variable-length audio inputs, which restricts its applicability in scenarios requiring long-form audio processing. Additionally, the focus on sound event detection may overlook other temporally grounded tasks, indicating areas for future exploration.
The advancements made in FineLAP have significant implications for audio understanding and multimodal learning, particularly in applications such as automated audio captioning, sound event detection, and audio editing. The model's ability to leverage heterogeneous data could lead to more robust and flexible audio-language systems, potentially benefiting various industries, including entertainment, accessibility, and security. FineLAP presents a novel training paradigm that effectively combines heterogeneous supervision for fine-grained audio-language pretraining. The comprehensive methodology and robust experimental validation position it as a significant contribution to the field of audio understanding, with potential applications across diverse domains.
Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.
Primary: College of Innovation and Technology, University of Michigan-Flint
All Institutions: College of Innovation and Technology, University of Michigan-Flint
The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.
The proposed TRACE framework introduces a novel approach to detecting partial audio deepfakes without the need for training or labeled data. By analyzing the first-order dynamics of frozen speech foundation model representations, the methodology cleverly leverages the inherent properties of genuine speech versus manipulated audio. This is a significant departure from traditional supervised methods, showcasing a fresh perspective on audio forensics. However, the paper could benefit from a more detailed explanation of the embedding trajectory analysis and its computational efficiency.
The experiments are well-structured, evaluating TRACE on four benchmarks across two languages and using six different speech foundation models. The results demonstrate competitive performance against fine-tuned supervised baselines, particularly in challenging scenarios like LlamaPartialSpoof. However, the paper lacks comprehensive details on the datasets used, such as their sizes and the specific characteristics of the audio samples, which would enhance the understanding of the evaluation's robustness.
The paper does not provide sufficient details regarding the implementation of TRACE, such as the specific configurations of the speech foundation models used or the exact procedures for embedding trajectory analysis. This lack of detail may hinder reproducibility, as other researchers may struggle to replicate the results without clear guidelines or code availability.
One limitation is the reliance on the performance of existing speech foundation models, which may vary in quality and robustness. Additionally, while the training-free approach is innovative, it may not generalize well to all forms of audio manipulation beyond the tested benchmarks. The paper also does not address potential adversarial attacks against the proposed detection method.
The implications of TRACE are significant for the field of audio forensics, particularly in combating misinformation and enhancing the integrity of audio content. The training-free nature of the method could facilitate its adoption in real-world applications where rapid detection is critical, such as in media verification and security. However, further exploration of its applicability across diverse audio manipulation techniques is necessary. The main contribution of this paper is the introduction of TRACE, a training-free framework for detecting partial audio deepfakes by analyzing the dynamics of speech foundation model embeddings. This work represents a significant advancement in audio forensics, offering a novel methodology that challenges traditional supervised detection approaches and opens new avenues for research in the field.