Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Primary: German Research Center for Artificial Intelligence (DFKI)
All Institutions: German Research Center for Artificial Intelligence (DFKI), University of Stuttgart, National Institute of Informatics, Technical University of Berlin
The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The methodology presented in DeepFense is robust and well-structured, focusing on creating a modular and extensible framework for deepfake audio detection. The use of a configuration-driven design allows for easy experimentation and reproducibility, which is a significant advancement in the field. The integration of over 400 models and 100 recipes enhances the toolkit's utility for researchers. The modular architecture facilitates the isolation of algorithmic innovations from implementation artifacts, which is critical for accurate benchmarking.
The experimental evaluation is extensive, covering a large-scale comparison of 400 models across 13 datasets, which is a notable strength of the paper. The results provide valuable insights into the impact of front-end feature extractors, back-end architectures, and training datasets on model performance. The findings regarding biases in model performance based on audio quality, speaker gender, and language are particularly important for ensuring equitable AI systems.
The paper emphasizes reproducibility through its open-source nature and the provision of a comprehensive toolkit that allows other researchers to replicate experiments easily. The use of a single YAML file for experiment configuration is a strong point, as it simplifies the process of sharing and reproducing results.
While the paper presents a significant advancement, it acknowledges limitations such as the lack of a multi-dataset training pipeline and the focus solely on detection tasks. These limitations suggest areas for future research, including the need for more comprehensive training strategies that can mitigate biases.
The implications of this work are substantial, particularly in the context of increasing concerns about deepfake technology and its potential misuse. By providing a standardized toolkit for deepfake detection, DeepFense can help improve the robustness of systems used in real-world applications, thereby enhancing security and trust in voice biometric systems. The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100--1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Department of Electrical and Computer Engineering, Carl von Ossietzky Universität Oldenburg, Department of Medical Physics and Acoustics, Zhejiang University, College of Computer Science and Technology
This paper introduces a pioneering ANC system for open-ear smart glasses that operates without error microphones, demonstrating significant noise reduction capabilities in real-world settings. The innovative methodology and thorough experimental evaluation contribute meaningfully to the field of audio processing and wearable technology, paving the way for future advancements in auditory interfaces.
The paper presents a novel approach to active noise cancellation (ANC) specifically designed for open-ear smart glasses, which traditionally face challenges due to their non-occlusive design. The methodology leverages a dual-pipeline architecture that separates the estimation of noise propagation and the generation of anti-noise signals, utilizing a neural network for virtual in-ear sensing. This innovative approach circumvents the need for error microphones, which are typically required in ANC systems, by estimating the sound at the ear from an array of microphones distributed around the glasses frame. The use of a custom 3D-printed prototype and the integration of a low-latency DSP unit for real-time processing further enhance the practicality of the solution.
The experimental evaluation is robust, encompassing controlled benchtop tests on a mannequin head and real-world user studies across various environments. The authors demonstrate effective noise reduction performance, achieving a mean reduction of 9.6 dB without calibration and 11.2 dB with user-specific calibration. The study involved 11 participants and assessed performance across 8 different environments, showcasing the system's adaptability to diverse acoustic conditions. The use of both objective metrics (e.g., noise reduction levels) and subjective user ratings (e.g., clarity and intrusiveness) strengthens the evaluation.
The paper provides detailed descriptions of the hardware setup, including the specifications of the microphones and DSP units used, as well as the neural network architecture. However, the lack of a publicly accessible demo or project URL limits reproducibility. The authors do mention the use of a calibration procedure, which could be a barrier for replication without access to the same hardware setup.
Key limitations include the system's reduced performance in outdoor environments due to wind noise, which the authors acknowledge as a significant challenge for open-ear designs. Additionally, the reliance on a brief calibration procedure may not be feasible for all users, particularly if the glasses shift during extended wear. The neural network's filter update rate of 200 ms could also hinder responsiveness to rapid changes in the acoustic environment.
The potential applications of this research extend beyond smart glasses to other open-ear wearables, such as augmented and virtual reality headsets. The ability to enhance audio clarity in noisy environments could significantly improve user experience in various contexts, including professional training and everyday use. The findings could also inform future developments in auditory interfaces and personalized hearing assistance technologies. This paper introduces a pioneering ANC system for open-ear smart glasses that operates without error microphones, demonstrating significant noise reduction capabilities in real-world settings. The innovative methodology and thorough experimental evaluation contribute meaningfully to the field of audio processing and wearable technology, paving the way for future advancements in auditory interfaces.
Rapid advances in singing voice synthesis have increased unauthorized imitation risks, creating an urgent need for better Singing Voice Deepfake (SingFake) Detection, also known as SVDD. Unlike speech, singing contains complex pitch, wide dynamic range, and timbral variations. Conventional 16 kHz-sampled detectors prove inadequate, as they discard vital high-frequency information. This study presents the first systematic analysis of high-resolution (44.1 kHz sampling rate) audio for SVDD. We propose a joint fullband-subband modeling framework: the fullband captures global context, while subband-specific experts isolate fine-grained synthesis artifacts unevenly distributed across the spectrum. Experiments on the WildSVDD dataset demonstrate that high-frequency subbands provide essential complementary cues. Our framework significantly outperforms 16 kHz-sampled models, proving that high-resolution audio and strategic subband integration are critical for robust in-the-wild detection.
Primary: National Taiwan University
All Institutions: National Taiwan University, NVIDIA Taiwan
The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
The paper introduces a novel joint fullband-subband modeling framework, Sing-HiResNet, which effectively captures both global and localized spectral features for high-resolution SingFake detection. The methodology is well-structured, employing a two-phase approach that integrates fullband and subband models, and explores various fusion strategies to enhance detection performance. The use of high-resolution audio (44.1 kHz) is a significant advancement over conventional methods, and the systematic evaluation of subband contributions adds depth to the methodology. However, the paper could benefit from clearer explanations of the fusion strategies and their implications.
The experiments are robust, utilizing the WildSVDD dataset to benchmark the proposed method against existing state-of-the-art systems. The results demonstrate a significant performance improvement over traditional 16 kHz models, achieving a state-of-the-art EER of 1.58%. The comparative analysis of different fusion strategies provides valuable insights into the effectiveness of the proposed approach. However, the paper lacks detailed statistical analysis of the results, which would strengthen the findings.
The paper provides a comprehensive description of the experimental setup, including dataset preparation, model architecture, and training procedures. However, it lacks a public code repository or demo URL, which would enhance reproducibility. The absence of shared resources limits the ability of other researchers to replicate the findings.
One limitation is the reliance on a single dataset (WildSVDD), which may not fully capture the diversity of real-world singing voice deepfakes. Additionally, while the paper discusses various fusion strategies, it does not explore the computational efficiency of these methods, which could be a concern for real-time applications. The authors could also provide more insights into the potential impact of noise and other artifacts in the audio data.
The research addresses a critical issue in the realm of audio synthesis and deepfake detection, with implications for copyright protection, content authenticity, and the broader field of audio forensics. The findings could inform future developments in anti-spoofing technologies and contribute to the establishment of standards for audio quality evaluation in deepfake detection. The main contribution of this paper is the introduction of a joint fullband-subband modeling framework for high-resolution SingFake detection, which significantly enhances detection performance by leveraging the unique characteristics of singing voice audio. The methodology is innovative and addresses a pressing need in the field of audio forensics, making it a valuable addition to the literature.
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group, Chinese Academy of Sciences, Beijing Institute of Technology, Shanghai Jiao Tong University
The main contribution of this paper is the establishment of the AT-ADD Grand Challenge, which aims to enhance the robustness and generalization of audio deepfake detection systems across various audio types, thereby addressing critical gaps in current methodologies. This initiative is significant for its potential to drive forward research in audio forensics and improve the reliability of detection technologies in real-world applications.
The paper proposes a comprehensive evaluation framework for audio deepfake detection that includes two distinct tracks focusing on robustness in speech detection and generalization across all audio types. The methodology is well-structured, providing detailed descriptions of datasets, evaluation metrics, and baseline models, which are essential for fostering competitive research in the field. The emphasis on real-world applicability and the inclusion of diverse audio types are notable strengths.
The experimental design is robust, with extensive datasets constructed for both tracks, ensuring a fair evaluation of detection methods. The paper outlines the composition of training, development, and evaluation datasets, as well as the metrics used for performance assessment, which enhances the credibility of the challenge. However, specific results from preliminary experiments are not presented, which could have strengthened the evaluation.
The paper emphasizes reproducibility by providing standardized datasets and baseline models, along with clear rules for participation in the challenge. However, the lack of detailed implementation specifics for the proposed models limits the ability for external researchers to replicate results fully.
The paper does not address potential biases in the datasets or the limitations of the proposed methods in handling extreme variations in audio quality or types beyond those specified. Additionally, the challenge's closed setting may restrict innovation by limiting the use of external data.
The proposed AT-ADD challenge has the potential to significantly advance the field of audio deepfake detection by encouraging the development of more robust and generalizable detection systems. This is crucial in an era where synthetic audio poses increasing security and trust challenges. The focus on diverse audio types also opens avenues for research in multimedia forensics and secure communication. The main contribution of this paper is the establishment of the AT-ADD Grand Challenge, which aims to enhance the robustness and generalization of audio deepfake detection systems across various audio types, thereby addressing critical gaps in current methodologies. This initiative is significant for its potential to drive forward research in audio forensics and improve the reliability of detection technologies in real-world applications.
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Hello Group Inc.
The paper presents CapTalk, a novel framework for voice design that significantly enhances dialogue speech generation capabilities. The comprehensive evaluation and innovative methodologies contribute meaningfully to the field of controllable speech synthesis, addressing key challenges in expressive and context-aware voice generation.
The paper introduces CapTalk, a unified caption-conditioned text-audio autoregressive framework that innovatively extends voice design to both single-utterance and dialogue settings. The methodology includes a hierarchical variational conditioning module that effectively balances timbre preservation and contextual adaptation, which is a significant advancement over existing methods that primarily focus on single-utterance generation. The use of CoT control sequences for turn-level expressive control in dialogue is particularly noteworthy, as it allows for dynamic adjustments based on conversational context.
The experiments are comprehensive, demonstrating state-of-the-art performance on benchmarks for single-utterance voice design and improved expression controllability in dialogue settings. The authors employ both automatic and human evaluations, which adds robustness to their findings. The detailed evaluation protocol for dialogue generation is a valuable contribution, addressing gaps in existing benchmarks.
The paper lacks detailed implementation specifics that would enhance reproducibility, such as hyperparameters, training procedures, and data preprocessing steps. While the architecture is described, additional details on the training setup would be beneficial for other researchers looking to replicate or build upon this work.
The reliance on the quality of captions generated by Qwen3-Omni could introduce biases or inaccuracies, affecting the overall performance of the model. Additionally, the training data's focus on conversational speech may limit the model's expressive range compared to acted-style speech, which could be addressed in future work.
The advancements in voice design through CapTalk have the potential to significantly enhance human-computer interaction, making conversational agents more expressive and context-aware. This could lead to more natural and engaging user experiences in applications such as virtual assistants, gaming, and interactive storytelling. The paper presents CapTalk, a novel framework for voice design that significantly enhances dialogue speech generation capabilities. The comprehensive evaluation and innovative methodologies contribute meaningfully to the field of controllable speech synthesis, addressing key challenges in expressive and context-aware voice generation.
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
Primary: Muséum National d'Histoire Naturelle
All Institutions: Muséum National d'Histoire Naturelle, Sebitoli Chimpanzee Project, Uganda Wildlife Authority, Nitidae Association, Centre d'Ecologie et des Sciences de la Conservation, Institut de Systématique, Evolution, Biodiversité
The main contribution of this paper is the development of DeepForestSound (DFS), a multi-species automatic detection model that significantly enhances the capabilities of passive acoustic monitoring in African tropical forests. The innovative use of semi-supervised learning and LoRA-based fine-tuning addresses critical challenges in biodiversity monitoring, particularly for underrepresented taxa, thereby advancing the field of ecoacoustics and conservation technology.
The methodology employed in this study is robust and innovative, leveraging a semi-supervised clustering approach to generate labeled datasets from unannotated acoustic recordings. The use of Low-Rank Adaptation (LoRA) for fine-tuning the Audio Spectrogram Transformer (AST) is particularly noteworthy, as it allows for efficient adaptation to the specific acoustic characteristics of the target taxa in a data-scarce environment. The detailed description of the data collection process, including ethical considerations and the integration of multiple datasets, enhances the credibility of the study. However, the absence of a systematic sensitivity analysis for hyperparameters and the lack of an ablation study to isolate the contributions of different components are notable gaps.
The experimental evaluation is comprehensive, with a clear focus on assessing the model's performance across various taxa, particularly in the context of non-avian species where existing models typically underperform. The results demonstrate that DFS outperforms baseline models, particularly for primates and elephants, which are often neglected in general-purpose ecoacoustic models. The use of Average Precision (AP) and best F1 scores as evaluation metrics is appropriate for the task. However, the evaluation is limited to a single ecosystem, which may affect the generalizability of the findings.
The paper provides sufficient details regarding the training process, data preprocessing, and model architecture, which supports reproducibility. The code and pretrained models are made publicly available, which is a positive aspect for the research community. However, the specific configurations for hyperparameters and augmentation strategies could benefit from clearer documentation to facilitate replication.
The study has several limitations, including the focus on a single geographical area, which may restrict the applicability of the model to other tropical forest ecosystems. Additionally, while the model shows strong performance for the selected taxa, its ability to generalize to other species or soundscapes remains untested. The reliance on manual validation for the semi-supervised pipeline may introduce biases or inconsistencies in the labeled data.
The potential applications of DFS are significant, particularly in conservation efforts aimed at monitoring endangered species in tropical forests. By providing a practical tool for biodiversity assessment, DFS could facilitate more effective conservation strategies and contribute to the understanding of ecosystem dynamics. The study highlights the importance of tailored machine learning approaches in addressing specific ecological challenges, which could inspire further research in similar contexts. The main contribution of this paper is the development of DeepForestSound (DFS), a multi-species automatic detection model that significantly enhances the capabilities of passive acoustic monitoring in African tropical forests. The innovative use of semi-supervised learning and LoRA-based fine-tuning addresses critical challenges in biodiversity monitoring, particularly for underrepresented taxa, thereby advancing the field of ecoacoustics and conservation technology.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
Primary: NIO
All Institutions: NIO
The paper presents a principled framework for optimizing LLM-based ASR systems through entropy allocation, significantly enhancing performance while mitigating hallucinations. The comprehensive methodology and robust experimental results position this work as a meaningful advancement in the intersection of speech recognition and language modeling.
The paper introduces a novel perspective on entropy allocation in LLM-based ASR systems, proposing a multi-stage training paradigm that emphasizes capability-boundary awareness. The methodology is well-grounded in theoretical insights, particularly the use of entropy metrics (NSE, PAI, CSAI) to diagnose and optimize the interaction between speech encoders and LLMs. The iterative asynchronous SFT (IA-SFT) stage is a significant innovation that mitigates representation drift and enhances model robustness against hallucinations. The approach is systematic and addresses key challenges in ASR, such as efficiency and accuracy, making it a valuable contribution to the field.
The experiments are comprehensive, covering multiple benchmarks in both Mandarin and English, and demonstrate competitive performance against state-of-the-art models with significantly fewer parameters. The evaluation metrics (CER and WER) are appropriate for the tasks, and the results indicate that the proposed method not only achieves high accuracy but also effectively reduces hallucination rates. The empirical analysis of metric dynamics throughout the training stages provides strong evidence supporting the claims made about the benefits of the proposed methodology.
The paper provides detailed training setups, including data statistics, model architectures, and training configurations. However, the lack of publicly available code or a demo URL limits reproducibility. Future work could benefit from sharing the model and training scripts to allow other researchers to validate and build upon these findings.
While the paper presents a robust framework, it does not address the scalability of the proposed methods to larger datasets or more complex ASR tasks beyond the benchmarks used. Additionally, the reliance on specific metrics for entropy allocation may not capture all nuances of model performance in diverse real-world scenarios.
The findings have significant implications for the deployment of LLM-based ASR systems in industrial applications, particularly in enhancing efficiency and reducing operational costs. The approach could lead to more reliable speech recognition systems that are better suited for real-time applications, thereby improving user experiences across various domains such as customer service, transcription services, and accessibility technologies. The paper presents a principled framework for optimizing LLM-based ASR systems through entropy allocation, significantly enhancing performance while mitigating hallucinations. The comprehensive methodology and robust experimental results position this work as a meaningful advancement in the intersection of speech recognition and language modeling.
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using in-the-wild data, which we demonstrate via systems trained using naturally-noisy speech from VoxCeleb.
Primary: Human Language Technology Center of Excellence, Johns Hopkins University
All Institutions: Human Language Technology Center of Excellence, Johns Hopkins University, Language Technologies Institute, Carnegie Mellon University
The main contribution of this paper is the introduction of a novel ring mixing strategy combined with a SCER auxiliary loss to improve unsupervised denoising in speech separation systems, significantly enhancing their performance on naturally noisy data. This work represents a meaningful advancement in the field, addressing critical challenges in speech processing and paving the way for more effective applications in real-world scenarios.
The proposed methodology introduces a novel ring mixing strategy and a Signal-to-Consistency-Error Ratio (SCER) auxiliary loss to improve unsupervised denoising in speech separation. This approach effectively addresses the limitations of traditional methods that rely on synthetic mixtures, providing a more robust framework for training on naturally noisy data. The methodology is well-structured, with a clear theoretical foundation that justifies the need for breaking symmetry in loss functions, and it demonstrates a thoughtful consideration of the challenges in separating overlapping speech signals.
The experiments are comprehensive, utilizing both WHAM! and VoxCeleb datasets to validate the proposed method. The results show significant improvements in SI-SDR metrics, indicating effective denoising capabilities. The use of various noise levels and the analysis of occupancy metrics provide a thorough evaluation of the method's performance across different conditions. However, the paper lacks detailed comparisons with state-of-the-art methods, which could strengthen the claims of superiority.
The paper outlines the experimental setup, including model architecture and training configurations, but does not provide code or data access, which limits reproducibility. Clearer documentation of hyperparameters and training procedures would enhance reproducibility efforts.
One limitation is the potential degradation in performance on noiseless conditions, suggesting that the method may not generalize well across all scenarios. Additionally, the reliance on specific datasets may limit the applicability of the findings to broader contexts. The lack of a detailed exploration of the effect of varying the SCER weight on performance could also be seen as a gap.
The findings have significant implications for real-world applications, particularly in enhancing speech recognition systems in noisy environments. The ability to train models on naturally noisy data could lead to more robust and generalizable speech separation systems, benefiting various fields such as telecommunications, assistive technologies, and automated transcription services. The main contribution of this paper is the introduction of a novel ring mixing strategy combined with a SCER auxiliary loss to improve unsupervised denoising in speech separation systems, significantly enhancing their performance on naturally noisy data. This work represents a meaningful advancement in the field, addressing critical challenges in speech processing and paving the way for more effective applications in real-world scenarios.
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, AISpeech Ltd, Nanjing University
The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that enhances speech LLM post-training through improved alignment and low-resource adaptation. This work presents a significant advancement in the field of speech recognition, particularly for scenarios lacking extensive audio-text pairs, and demonstrates the potential for more effective and efficient training methodologies.
The methodology of TASU2 is well-structured, introducing a novel framework for generating controllable CTC posteriors from text. The use of WER conditioning to simulate CTC posteriors is innovative, addressing the limitations of previous methods like TASU. The architecture leverages a Transformer-based model, which is appropriate for the task, and the training process is clearly defined, including the use of distribution-level supervision. The paper effectively bridges the gap between text-derived supervision and acoustic decoding, enhancing the potential for low-resource adaptation.
The experiments are comprehensive, covering various adaptation scenarios and demonstrating clear improvements over the baseline TASU and other strong methods. The evaluation metrics are appropriate, and the results are presented clearly, showing consistent gains in both in-domain and out-of-domain recognition. However, the paper could benefit from more detailed comparisons with additional state-of-the-art methods to further validate the effectiveness of TASU2.
The paper provides sufficient details regarding the architecture and training process, which should allow for reproducibility. However, the absence of a publicly available code repository or demo limits the practical reproducibility of the results. Including a link to a GitHub repository or similar would enhance the paper's impact.
One limitation is the reliance on a single dataset (LibriSpeech) for training and evaluation, which may affect the generalizability of the results. Additionally, while the WER conditioning is a significant improvement, the method may still struggle with extreme cases of noise or distortion in real-world applications. The paper does not address potential scalability issues when applied to larger datasets or more complex tasks.
The proposed TASU2 framework has significant implications for low-resource speech recognition, particularly in domains where paired audio-text data is scarce. The ability to simulate CTC posteriors with controlled error rates could facilitate advancements in speech technology for various applications, including medical transcription and assistive technologies for individuals with speech impairments. This work could lead to more robust and adaptable speech recognition systems in real-world scenarios. The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that enhances speech LLM post-training through improved alignment and low-resource adaptation. This work presents a significant advancement in the field of speech recognition, particularly for scenarios lacking extensive audio-text pairs, and demonstrates the potential for more effective and efficient training methodologies.
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Primary: Unknown
All Institutions: Unknown
The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
The proposed TG-DP framework effectively decouples the objectives of masked reconstruction and contrastive learning into separate optimization paths. This dual-path approach allows for tailored visibility patterns that enhance cross-modal alignment while mitigating semantic noise and optimization interference. The introduction of a teacher-student mechanism further enriches the training process by providing structured guidance, which is a noteworthy advancement in the field. The methodology is well-structured and addresses existing challenges in audio-visual representation learning.
The experiments are comprehensive, utilizing large-scale datasets such as AudioSet-2M and VGGSound. The results demonstrate significant improvements in zero-shot retrieval performance, achieving state-of-the-art results across various metrics. The ablation studies provide valuable insights into the effectiveness of the proposed components, such as the dual-path structure and teacher-guided masking strategy. However, the paper could benefit from more detailed comparisons with additional baselines to further validate the claims.
The paper provides a clear description of the methodology and experimental setup, including hyperparameters and data preprocessing steps. The availability of code on GitHub enhances reproducibility. However, the lack of detailed information on the training environment and specific configurations may pose challenges for complete replication.
The primary limitation is the unknown primary institution and the lack of citation context, which may hinder the paper's visibility and impact in the academic community. Additionally, the performance improvements, while significant, may still be context-dependent and require further validation across diverse tasks and datasets.
The advancements in audio-visual representation learning have the potential to enhance various applications, including multimedia retrieval, content-based recommendation systems, and interactive AI systems. The proposed framework could lead to more robust models that understand and integrate audio-visual information, paving the way for future research and applications in multimodal AI. The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Primary: Shanghai Jiao Tong University
All Institutions: EPIC Lab, Shanghai Jiao Tong University, Xidian University, HKUST (GZ)
The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work represents a meaningful advancement in the field of audio processing, demonstrating innovative methodologies that address critical challenges in deploying large-scale models effectively.
The proposed methodology, AudioKV, innovatively addresses the inefficiencies of Key-Value (KV) cache management in Large Audio-Language Models (LALMs) by introducing a dual mechanism: audio-aware KV cache allocation and Spectral Score Smoothing (SSS). The former identifies and prioritizes audio-critical attention heads based on their relevance to acoustic modeling, while the latter employs a frequency-domain approach to stabilize importance score estimation. This dual approach is particularly effective in the audio domain, where temporal continuity is crucial, showcasing a thoughtful adaptation of existing techniques to a new modality.
The experiments conducted across various benchmarks, including Automatic Speech Recognition (ASR) and Speech Translation (ST), demonstrate that AudioKV significantly outperforms existing methods, particularly under aggressive compression scenarios. The results indicate not only improved accuracy but also enhanced robustness against performance degradation, which is critical for practical applications. The use of diverse datasets strengthens the validity of the findings, although the paper could benefit from more extensive comparisons with a broader range of state-of-the-art methods.
The paper mentions that the code will be released upon acceptance, which is a positive aspect for reproducibility. However, the lack of a demo URL or a project repository at this stage limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but actual code availability will be crucial for broader adoption and validation of the results.
One limitation is the potential for overfitting to specific datasets, as the performance improvements are primarily demonstrated on selected benchmarks. Additionally, while the method shows promise in maintaining accuracy at high compression ratios, the paper does not thoroughly explore the trade-offs involved in different compression strategies or the impact on latency and real-time processing capabilities.
The implications of this work extend to various applications in speech processing and multimodal AI systems, where efficient inference is paramount. By improving the efficiency of LALMs, this research could facilitate the deployment of advanced audio processing systems in resource-constrained environments, such as mobile devices or real-time applications. The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work represents a meaningful advancement in the field of audio processing, demonstrating innovative methodologies that address critical challenges in deploying large-scale models effectively.
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Nanjing University, Huawei Technologies Co., Ltd.
The main contribution of this paper is the introduction of the EvoTSE framework, which significantly enhances target speaker extraction by evolving the enrollment process to adapt to dynamic vocal characteristics, thereby reducing speaker confusion and improving extraction quality. The methodology is innovative, addressing critical challenges in the field, and the experimental results validate its effectiveness across multiple benchmarks.
The proposed EvoTSE framework innovatively transitions from a static to an evolving target speaker extraction (TSE) pipeline, addressing critical issues of speaker confusion and enrollment quality. By continuously updating the enrollment through a reliability-filtered retrieval mechanism, the framework effectively adapts to dynamic vocal characteristics over time. The architecture integrates multiple specialized components, including a contextual retriever and a reliability classifier, which enhance the robustness of the extraction process. The methodology is well-structured, leveraging historical context to improve speaker identification and mitigate confusion, which is a significant advancement in the field.
The experiments conducted across multiple benchmarks, including WSJ0-2mix, ESD-test, and Libri2mix-clean, demonstrate the effectiveness of EvoTSE in both in-domain and out-of-domain scenarios. The results show consistent improvements in performance metrics such as SI-SDRi and NSR, particularly in challenging emotional variations. The ablation studies provide further insights into the contributions of different components, validating the necessity of the evolving enrollment strategy. The comprehensive evaluation across diverse datasets enhances the credibility of the findings.
The authors have provided a GitHub repository with code and checkpoints, which is essential for reproducibility. The paper details the model configuration, training pipeline, and evaluation metrics, allowing other researchers to replicate the experiments. However, the absence of a live demo or interactive examples limits immediate accessibility to the framework's capabilities.
While the EvoTSE framework shows promising results, it may still be sensitive to the initial quality of the enrollment, particularly in highly variable emotional contexts. The reliance on historical estimates could introduce noise if not managed properly, potentially leading to performance degradation. Additionally, the computational complexity of maintaining an evolving memory bank may pose challenges in real-time applications.
The implications of this work extend to various applications in audio processing, such as voice assistants, transcription services, and any scenario requiring robust speaker identification in multi-talker environments. By improving the reliability of speaker extraction, the framework could enhance user experience in interactive systems and contribute to advancements in audio analysis and understanding. The main contribution of this paper is the introduction of the EvoTSE framework, which significantly enhances target speaker extraction by evolving the enrollment process to adapt to dynamic vocal characteristics, thereby reducing speaker confusion and improving extraction quality. The methodology is innovative, addressing critical challenges in the field, and the experimental results validate its effectiveness across multiple benchmarks.
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource performance. To address this, we propose a semi-supervised framework based on Semantic-Emotional Resonance Embedding (SERE), a cross-lingual dynamic feature paradigm that requires neither target language labels nor translation alignment. Specifically, SERE constructs an emotion-semantic structure using a small number of labeled samples. It learns human emotional experiences through an Instantaneous Resonance Field (IRF), enabling unlabeled samples to self-organize into this structure. This achieves semi-supervised semantic guidance and structural discovery. Additionally, we design a Triple-Resonance Interaction Chain (TRIC) loss to enable the model to reinforce the interaction and embedding capabilities between labeled and unlabeled samples during emotional highlights. Extensive experiments across multiple languages demonstrate the effectiveness of our method, requiring only 5-shot labeling in the source language.
Primary: Xinjiang University
All Institutions: Xinjiang University, Pengcheng Laboratory Xinjiang Network Node, Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center, Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing
The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that leverages emotional resonance without requiring extensive labeled data. This innovative approach addresses significant challenges in the field and demonstrates promising results across multiple languages, although it could benefit from improved reproducibility and detailed evaluation metrics.
The proposed Semantic-Emotional Resonance Embedding (SERE) framework introduces a novel semi-supervised approach to Cross-Lingual Speech Emotion Recognition (CLSER) that does not require target language labels or translation alignment. The use of an Instantaneous Resonance Field (IRF) to capture emotional highlights and the Triple-Resonance Interaction Chain (TRIC) loss to enhance interactions between labeled and unlabeled samples are significant methodological advancements. The dual-path architecture effectively combines labeled and unlabeled data streams, allowing for dynamic feature extraction that captures the transient nature of emotional expression in speech.
The experiments span multiple languages and tasks, demonstrating the effectiveness of the SERE framework with only 5-shot labeling. The paper presents a thorough evaluation against state-of-the-art methods, showcasing superior performance across various tasks. The use of diverse datasets strengthens the findings, although the specific metrics for evaluation could be more detailed to enhance clarity on performance comparisons.
While the paper outlines the architecture and methodology, it lacks specific implementation details such as hyperparameter settings, training procedures, and code availability, which are crucial for reproducibility. The absence of a project URL further complicates this aspect, as external researchers cannot easily access the code or datasets used.
The paper acknowledges challenges related to linguistic differences in emotional expression, which can affect performance. Additionally, the reliance on a limited number of labeled samples may not generalize well to all low-resource languages. The method's performance could also be influenced by the quality and diversity of the training data, which is not extensively discussed.
This research has the potential to significantly impact the field of emotion recognition in multilingual contexts, particularly for low-resource languages. By enabling effective emotion recognition without extensive labeled data, it could facilitate applications in areas such as human-computer interaction, mental health monitoring, and cross-cultural communication. The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that leverages emotional resonance without requiring extensive labeled data. This innovative approach addresses significant challenges in the field and demonstrates promising results across multiple languages, although it could benefit from improved reproducibility and detailed evaluation metrics.
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.
Primary: University of California San Diego
All Institutions: University of California San Diego
The main contribution of this paper is the development of a real-time human-AI musical co-performance system that effectively generates instrumental accompaniment using latent diffusion models, addressing critical latency challenges while maintaining musical coherence and quality. This work significantly advances the field of AI-driven music generation by providing a practical solution for live performance contexts, showcasing the potential for AI to enhance creative collaboration in music.
The paper introduces a novel framework for real-time human-AI musical co-performance using latent diffusion models (LDMs) integrated with MAX/MSP for low-latency audio processing. The methodology is well-articulated, detailing a sliding-window look-ahead protocol that enables the model to generate audio segments ahead of playback, thus addressing the latency challenges inherent in real-time music generation. The use of consistency distillation to enhance inference speed is particularly noteworthy, as it allows the model to maintain real-time capabilities while generating high-quality audio. The integration of a model-agnostic MAX/MSP external and a ready-to-use performance patch further enhances the practical applicability of the research.
The experiments are thorough, utilizing the Slakh2100 dataset and comparing the proposed models against established baselines such as StreamMusicGen. The evaluation metrics employed, including COCOLA scores for musical coherence, Beat F1 scores for rhythmic alignment, and Fréchet Audio Distance (FAD) for audio quality, provide a comprehensive assessment of the models' performance across different look-ahead configurations. The results demonstrate that the proposed models perform competitively, especially in the Look-ahead regime, indicating their effectiveness in real-time scenarios.
The authors provide detailed implementation information, including model architecture, training procedures, and evaluation metrics, which enhances reproducibility. The availability of code repositories and pre-trained model checkpoints further supports this aspect, allowing other researchers to replicate the study and build upon the findings.
One limitation is the reliance on a specific dataset (Slakh2100), which may not fully represent the diversity of musical styles and contexts encountered in real-world applications. Additionally, while the paper addresses latency effectively, the trade-offs between look-ahead depth and generation quality may still pose challenges in more complex musical scenarios. The subjective evaluation of generated music quality could also benefit from more extensive human listener studies.
The framework developed in this paper has significant implications for the future of human-AI collaboration in music performance, potentially transforming how musicians interact with AI systems in live settings. By bridging the gap between advanced generative models and real-time performance environments, this research opens avenues for innovative musical expressions and collaborative practices. The main contribution of this paper is the development of a real-time human-AI musical co-performance system that effectively generates instrumental accompaniment using latent diffusion models, addressing critical latency challenges while maintaining musical coherence and quality. This work significantly advances the field of AI-driven music generation by providing a practical solution for live performance contexts, showcasing the potential for AI to enhance creative collaboration in music.
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science
The paper presents ULTRAS, a self-supervised learning framework that effectively integrates time-frequency modeling for audio and speech signals. The innovative approach and comprehensive evaluation demonstrate its potential to improve performance across diverse tasks, marking a significant advancement in the field of audio representation learning.
The proposed ULTRAS framework introduces a novel approach to self-supervised learning in audio and speech processing by integrating long-context masking and joint predictive modeling of spectral and temporal features. The methodology effectively addresses the limitations of existing models by allowing for the simultaneous encoding of time and frequency traits, which is crucial for diverse audio tasks. The use of transformer architecture and a combined loss function is well-justified, and the masking strategy is innovative in its focus on longer audio segments, enhancing the model's ability to capture contextual dependencies.
The paper presents a comprehensive evaluation across various speech and audio tasks, demonstrating significant improvements over established baselines. The experiments are well-structured, utilizing both small and large datasets to validate the model's performance. The results are clearly presented, with comparisons against multiple state-of-the-art frameworks, showcasing the effectiveness of the proposed approach. However, the paper could benefit from more detailed ablation studies to further elucidate the contributions of individual components.
The implementation details are sufficiently described, including the model architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository limits the reproducibility of the results. Providing access to the model and datasets would enhance the paper's impact and facilitate further research.
One limitation is the reliance on a relatively smaller dataset compared to some state-of-the-art models, which may affect the generalizability of the results. Additionally, while the framework shows promise, the paper does not extensively explore the scalability of the approach to larger and more diverse datasets.
The ULTRAS framework has the potential to significantly advance the field of audio and speech processing by providing a unified model that can be applied across various tasks. Its implications extend to real-world applications such as speech recognition, emotion detection, and environmental sound classification, making it a valuable contribution to the domain. The paper presents ULTRAS, a self-supervised learning framework that effectively integrates time-frequency modeling for audio and speech signals. The innovative approach and comprehensive evaluation demonstrate its potential to improve performance across diverse tasks, marking a significant advancement in the field of audio representation learning.
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.
Primary: Chittagong University of Engineering and Technology
All Institutions: Chittagong University of Engineering and Technology
The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
The proposed methodology introduces a novel dual-path attention mechanism integrated into a complex-valued frequency transformation network (CFTNet), which is a significant advancement in the field of speech enhancement, particularly for cochlear implant users. The combination of intra-chunk and inter-chunk RNNs with attention modules allows for enhanced modeling of speech and noise dynamics in time-frequency representations. The detailed architecture and the rationale behind the design choices are well articulated, showcasing a thoughtful approach to addressing the limitations of existing models.
The experiments are robust, employing a comprehensive dataset that includes various noise conditions and SNR levels. The evaluation metrics used (STOI, PESQ, SISDR) are appropriate for assessing speech intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and a discussion on the statistical significance of the results.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of a specific dataset and the architecture of the model, there are no code repositories or links to a demo that would allow other researchers to replicate the findings. Providing access to the model and training scripts would greatly enhance reproducibility.
One limitation is the reliance on objective metrics without a thorough subjective evaluation involving human listeners. While objective scores are important, subjective assessments are crucial for applications in speech enhancement, especially for cochlear implant users. Additionally, the model's complexity may limit its applicability in real-time scenarios, which is a critical factor for practical implementations.
The proposed DAT-CFTNet has the potential to significantly improve the quality of life for cochlear implant recipients by enhancing speech intelligibility in noisy environments. This advancement could lead to better communication and social interactions for individuals with hearing impairments. The public availability of the model also encourages further research and development in the field. The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, University of Amsterdam
This paper presents a novel automatic framework for detecting speaker drift in synthesized speech, bridging geometric signal analysis with LLM-based perceptual reasoning. The comprehensive methodology, combined with strong experimental validation, positions this work as a significant contribution to the field of audio and speech synthesis, addressing a critical challenge in TTS systems.
The proposed methodology introduces a novel framework for detecting speaker drift in synthesized speech by formulating it as a binary classification task. The use of cosine similarity to assess speaker identity consistency is theoretically grounded, and the integration of large language models (LLMs) for perceptual reasoning is innovative. The construction of a synthetic benchmark dataset with human-validated annotations further strengthens the methodology, allowing for systematic evaluation of the proposed approach. However, the reliance on synthetic data may limit the generalizability of the findings.
The experimental setup is robust, utilizing a well-defined dataset and comparing the proposed method against fixed-threshold and PCA-based baselines. The results demonstrate a significant improvement in performance metrics (F1 score) when using the LLM-driven approach, indicating the effectiveness of the proposed method. The ablation studies provide valuable insights into the impact of different design choices on performance, reinforcing the validity of the findings.
While the paper provides a detailed description of the methodology and experimental setup, the absence of a publicly available code repository or dataset limits reproducibility. Future work should include making the dataset and code accessible to facilitate further research in this area.
One notable limitation is the reliance on synthetic data for training and evaluation, which may not fully capture the complexities of real-world speaker drift scenarios. Additionally, the framework's performance may vary with different TTS models, and further validation on diverse datasets is needed to establish its robustness.
The detection of speaker drift has significant implications for improving the quality and coherence of synthesized speech in various applications, including virtual assistants and interactive dialogue systems. By addressing this underexplored issue, the work contributes to enhancing user experience in TTS systems, paving the way for more reliable and natural-sounding synthetic speech. This paper presents a novel automatic framework for detecting speaker drift in synthesized speech, bridging geometric signal analysis with LLM-based perceptual reasoning. The comprehensive methodology, combined with strong experimental validation, positions this work as a significant contribution to the field of audio and speech synthesis, addressing a critical challenge in TTS systems.
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Fudan University, Wheatland Culture and Media Ltd.
This paper presents a significant advancement in controllable singing style conversion through innovative methodologies that address key challenges in the field. The combination of a boundary-aware semantic bottleneck, explicit technique control, and high-frequency band completion strategies demonstrates a comprehensive approach to improving the quality and fidelity of singing voice conversion systems.
The proposed methodology introduces a boundary-aware semantic bottleneck that effectively mitigates style leakage in singing voice conversion, which is a significant challenge in the field. The explicit frame-level technique matrix enhances control over dynamic styles, while the high-frequency band completion strategy addresses data scarcity issues. The integration of these components demonstrates a thoughtful approach to improving the quality and fidelity of converted singing voices, making the methodology both innovative and practical.
The experimental evaluation is robust, utilizing subjective metrics such as Mean Opinion Score (MOS) to assess naturalness and similarity, which are critical for audio applications. The results indicate that the proposed system outperforms other submissions in naturalness while maintaining competitive performance in speaker similarity and technique control. The ablation studies further validate the effectiveness of the proposed methods, providing a clear understanding of their contributions.
The paper includes sufficient implementation details and provides a GitHub repository for code access, which enhances reproducibility. The use of standard datasets and well-defined training protocols also supports the replicability of the results.
One limitation is the reliance on the official SVCC2025 dataset, which may not generalize well to other datasets or real-world applications. Additionally, while the system achieves high naturalness, there is a noted gap in identity similarity compared to top-performing systems that utilized larger external datasets.
The advancements in controllable singing style conversion have significant implications for music production, voice synthesis, and entertainment industries. The ability to manipulate singing styles with high fidelity can enhance creative expression and provide new tools for artists and producers. This paper presents a significant advancement in controllable singing style conversion through innovative methodologies that address key challenges in the field. The combination of a boundary-aware semantic bottleneck, explicit technique control, and high-frequency band completion strategies demonstrates a comprehensive approach to improving the quality and fidelity of singing voice conversion systems.
Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse sentiment cues into a single compact representation before sentiment reasoning. This early aggregation makes it difficult to preserve the internal structure of sentiment evidence, where different cues may complement, conflict with, or differ in reliability from each other. In addition, modality importance is often determined only once during fusion, so later reasoning cannot further adjust modality contributions. To address these issues, we propose PRISM, a framework that unifies structured affective extraction and adaptive modality evaluation. PRISM organizes multimodal evidence in a shared prototype space, which supports structured cross-modal comparison and adaptive fusion. It further applies dynamic modality reweighting during reasoning, allowing modality contributions to be continuously refined as semantic interactions become deeper. Experiments on three benchmark datasets show that PRISM outperforms representative baselines.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Zhongguancun Academy
The main contribution of this paper is the PRISM framework, which innovatively organizes multimodal sentiment evidence into shared prototypes, allowing for structured extraction and adaptive evaluation of modality contributions. This work significantly advances the field of multimodal sentiment analysis by addressing key limitations in existing approaches and demonstrating robust performance across multiple datasets.
The proposed PRISM framework introduces a novel approach to multimodal sentiment analysis by utilizing shared sentiment prototypes that facilitate structured extraction and adaptive modality evaluation. This methodology effectively addresses the limitations of early aggregation in existing systems by maintaining the internal structure of sentiment evidence across modalities. The dynamic modality reweighting during reasoning further enhances the model's ability to adaptively refine modality contributions, making the approach both innovative and practical.
The experiments conducted on three benchmark datasets (CMU-MOSI, CMU-MOSEI, and CH-SIMS) demonstrate the effectiveness of the PRISM framework, showcasing significant performance improvements over various baselines. The use of ablation studies provides a clear understanding of the contributions of each component, reinforcing the robustness of the proposed methodology.
The paper provides sufficient implementation details, including hyperparameters and training configurations, which enhances reproducibility. The availability of the code on GitHub further supports this aspect, allowing other researchers to replicate the experiments and validate the findings.
While the framework shows promising results, it may still face challenges in handling highly noisy data or extreme cases where modalities conflict significantly. Additionally, the reliance on pre-extracted features may limit the model's adaptability to different data sources or domains.
The advancements in multimodal sentiment analysis have significant implications for various applications, including human-computer interaction, affective computing, and content understanding. By improving the accuracy of sentiment prediction from multimodal data, this research can enhance user experience in applications such as virtual assistants, social media analysis, and video content evaluation. The main contribution of this paper is the PRISM framework, which innovatively organizes multimodal sentiment evidence into shared prototypes, allowing for structured extraction and adaptive evaluation of modality contributions. This work significantly advances the field of multimodal sentiment analysis by addressing key limitations in existing approaches and demonstrating robust performance across multiple datasets.
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.
Primary: IIT Kharagpur
All Institutions: IIT Kharagpur
This work introduces a novel morphing technique for voice biometrics that significantly enhances the potential for attacks on speaker verification systems. The comprehensive evaluation of the TD-VIM method across various devices and languages demonstrates its effectiveness and raises critical security concerns in the field of biometric authentication.
The proposed Time-Domain Voice Identity Morphing (TD-VIM) method innovatively performs morphing at the signal level, circumventing the limitations of previous feature-based approaches. By selecting portions of voice signals and averaging them, the method achieves a language and backbone independence that enhances its applicability across diverse speaker verification systems. The methodology is well-structured, with clear steps outlined for signal selection, preprocessing, and morphing, although the paper could benefit from more detailed mathematical formulations and justifications for the choices made during these processes.
The experiments are comprehensive, utilizing a robust dataset (MAVS) and multiple speaker verification systems (SVS) to evaluate the effectiveness of the TD-VIM approach. The use of the Generalized Morph Attack Potential (G-MAP) metric provides a solid framework for quantifying the vulnerability of SVS to morphing attacks. Results indicate high attack success rates across different devices and languages, demonstrating the method's effectiveness. However, the paper could improve by including more comparative analyses with existing methods to highlight its advantages.
The authors provide access to the source code and morphed samples upon request, which is a positive aspect for reproducibility. However, the paper lacks detailed instructions on how to replicate the experiments fully, such as specific configurations and parameter settings used during the experiments.
One limitation is the reliance on a specific dataset (MAVS), which may not generalize to all voice biometric systems. Additionally, the paper does not address potential ethical concerns related to the misuse of morphing techniques in biometric systems. The impact of different environmental factors on the morphing effectiveness is also not explored, which could affect real-world applications.
The findings of this research have significant implications for the security of voice biometric systems, particularly in sensitive applications like banking and finance. By highlighting vulnerabilities, the work encourages the development of more robust verification systems and raises awareness about the potential for morphing attacks. The proposed method could lead to advancements in biometric security measures, prompting further research into countermeasures against such vulnerabilities. This work introduces a novel morphing technique for voice biometrics that significantly enhances the potential for attacks on speaker verification systems. The comprehensive evaluation of the TD-VIM method across various devices and languages demonstrates its effectiveness and raises critical security concerns in the field of biometric authentication.
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
The paper introduces the Anchored Cyclic Generation (ACG) paradigm, which effectively addresses the error accumulation problem in autoregressive models for long-sequence symbolic music generation. The methodology is well-structured, employing a hierarchical approach through the Hi-ACG framework that combines global and local generation strategies. The use of a novel piano token representation enhances efficiency and interpretability. The proposed methods are theoretically sound, supported by mathematical analysis, and demonstrate a clear innovation in the field of music generation.
The experimental evaluation is robust, utilizing both objective and subjective metrics to assess the performance of the proposed models against established baselines. The datasets used (MuseScore and POP909) are appropriate for the task, and the results indicate significant improvements in generation quality, as evidenced by a 34.7% reduction in cosine distance between predicted and ground-truth features. The comprehensive evaluation strategy enhances the credibility of the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate validation of results.
The paper acknowledges limitations in fine-grained control during generation and the potential loss of subtle timing nuances in the piano token representation. Additionally, the focus on piano music may restrict the applicability of the framework to other musical contexts. Future research should address these limitations by integrating more expressive tokens and extending the framework to multi-track music generation.
The proposed ACG paradigm has the potential to significantly advance the field of symbolic music generation, offering new avenues for creating high-quality, structurally coherent music. Its principles could be adapted to other long-sequence generation tasks beyond music, such as text generation and structured content synthesis, thereby broadening its impact across various domains. The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
The proposed BDATP framework introduces two innovative components: the Binaural Difference Attention (BDA) module, which enhances spatial audio perception by focusing on interaural differences, and the Action Transition Prediction (ATP) task, which regularizes policy learning to improve generalization across unseen environments. This dual approach effectively addresses the limitations of existing AVN methods, particularly their tendency to overfit to specific training conditions. The methodology is well-structured, with clear explanations of how each component contributes to the overall framework.
The experiments are comprehensive, utilizing two well-known datasets (Replica and Matterport3D) to evaluate the effectiveness of BDATP. The authors provide a thorough comparison against several state-of-the-art baselines, demonstrating significant performance improvements in both heard and unheard sound categories. The metrics used (Success Rate, Success weighted by Path Length, and Success weighted by Number of Actions) are appropriate for the task and provide a clear picture of the framework's capabilities.
The paper lacks explicit details on the implementation, such as hyperparameters, training procedures, and code availability, which could hinder reproducibility. While the methodology is described in detail, providing access to the code and models would greatly enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world environments. Additionally, while the proposed methods show strong performance in zero-shot settings, the paper does not address how the framework would perform in dynamic environments with moving sound sources or in multi-agent scenarios.
The BDATP framework has the potential to significantly advance the field of audio-visual navigation, particularly in applications involving robotics and autonomous systems. Its focus on generalization could lead to more robust navigation systems in real-world scenarios, enhancing the capabilities of embodied agents in complex environments. The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.
Primary: unknown
All Institutions: unknown
The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
The methodology is robust, introducing a novel benchmark (FDB-v3) that evaluates spoken language models under realistic conditions, utilizing real human audio annotated for disfluencies. The design incorporates multi-step tool use across various domains, which is a significant advancement over previous benchmarks that relied on synthetic data or single-step tasks. The systematic approach to scenario formulation and audio collection enhances the validity of the evaluation.
The experiments are comprehensive, evaluating six different model configurations across multiple dimensions such as accuracy, latency, and turn-taking dynamics. The results are well-presented, showing clear performance differences among models and highlighting specific strengths and weaknesses, particularly in handling disfluencies and multi-step reasoning. The use of deterministic mock APIs for evaluation is a strong point, ensuring that the results are not confounded by external factors.
The paper provides sufficient detail regarding the experimental setup, including the models evaluated and the evaluation metrics used. However, the lack of specific implementation details or code availability limits reproducibility. The benchmark is open and reproducible, which is a positive aspect, but without access to the models, full replication of results may be challenging.
The study acknowledges limitations, such as the fixed server region for cloud-based evaluations and the lack of robustness testing against real-world network anomalies. Additionally, the dataset is relatively small (100 recordings), which may affect generalizability. The focus on specific disfluency categories may also overlook other potential challenges in real-world interactions.
This work has significant implications for the development of real-time voice agents, particularly in enhancing their ability to handle natural speech disfluencies and multi-step tasks. The findings suggest directions for future research, emphasizing the need for models that can balance speed and accuracy in dynamic conversational contexts. The benchmark itself could facilitate further advancements in the field by providing a standardized evaluation framework. The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
The proposed OmniSonic framework introduces a flow-matching-based diffusion model that effectively integrates video and text to generate comprehensive auditory scenes. The TriAttn-DiT architecture is a notable innovation, allowing simultaneous processing of on-screen environmental sounds, off-screen sounds, and speech conditions. The use of a Mixture-of-Experts (MoE) gating mechanism is a sophisticated approach that enhances the model's adaptability during audio generation. This methodology is well-structured and addresses the limitations of previous models, particularly in generating human speech alongside environmental sounds.
The authors present extensive experiments that demonstrate the superiority of OmniSonic over existing state-of-the-art methods. The creation of the UniHAGen-Bench benchmark, which includes over a thousand samples across diverse scenarios, is a significant contribution that facilitates fair evaluation and comparison in the field. The combination of objective metrics and human evaluations provides a robust assessment of the model's performance, although specific metrics used for evaluation could be elaborated further for clarity.
The paper provides a project page with a URL, but lacks detailed implementation specifics in the text that would enhance reproducibility. While the methodology is sound, the absence of code or detailed experimental setups may hinder other researchers from replicating the results.
One limitation is the lack of detailed discussion on the computational resources required for training the OmniSonic model, which could be a barrier for some researchers. Additionally, while the model excels in generating audio from video and text, its performance in more nuanced or complex auditory environments remains to be fully explored.
The ability to generate holistic audio from multimodal inputs has significant implications for various applications, including film and video production, virtual reality, and assistive technologies for the hearing impaired. The advancements in audio generation could lead to more immersive experiences in entertainment and education, making this research highly relevant to both academic and industry stakeholders. The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.