Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Primary: German Research Center for Artificial Intelligence (DFKI)
All Institutions: German Research Center for Artificial Intelligence (DFKI), University of Stuttgart, National Institute of Informatics, Technical University of Berlin
The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The methodology presented in DeepFense is robust and well-structured, focusing on creating a modular and extensible framework for deepfake audio detection. The use of a configuration-driven design allows for easy experimentation and reproducibility, which is a significant advancement in the field. The integration of over 400 models and 100 recipes enhances the toolkit's utility for researchers. The modular architecture facilitates the isolation of algorithmic innovations from implementation artifacts, which is critical for accurate benchmarking.
The experimental evaluation is extensive, covering a large-scale comparison of 400 models across 13 datasets, which is a notable strength of the paper. The results provide valuable insights into the impact of front-end feature extractors, back-end architectures, and training datasets on model performance. The findings regarding biases in model performance based on audio quality, speaker gender, and language are particularly important for ensuring equitable AI systems.
The paper emphasizes reproducibility through its open-source nature and the provision of a comprehensive toolkit that allows other researchers to replicate experiments easily. The use of a single YAML file for experiment configuration is a strong point, as it simplifies the process of sharing and reproducing results.
While the paper presents a significant advancement, it acknowledges limitations such as the lack of a multi-dataset training pipeline and the focus solely on detection tasks. These limitations suggest areas for future research, including the need for more comprehensive training strategies that can mitigate biases.
The implications of this work are substantial, particularly in the context of increasing concerns about deepfake technology and its potential misuse. By providing a standardized toolkit for deepfake detection, DeepFense can help improve the robustness of systems used in real-world applications, thereby enhancing security and trust in voice biometric systems. The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100--1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Department of Electrical and Computer Engineering, Carl von Ossietzky Universität Oldenburg, Department of Medical Physics and Acoustics, Zhejiang University, College of Computer Science and Technology
This paper introduces a pioneering ANC system for open-ear smart glasses that operates without error microphones, demonstrating significant noise reduction capabilities in real-world settings. The innovative methodology and thorough experimental evaluation contribute meaningfully to the field of audio processing and wearable technology, paving the way for future advancements in auditory interfaces.
The paper presents a novel approach to active noise cancellation (ANC) specifically designed for open-ear smart glasses, which traditionally face challenges due to their non-occlusive design. The methodology leverages a dual-pipeline architecture that separates the estimation of noise propagation and the generation of anti-noise signals, utilizing a neural network for virtual in-ear sensing. This innovative approach circumvents the need for error microphones, which are typically required in ANC systems, by estimating the sound at the ear from an array of microphones distributed around the glasses frame. The use of a custom 3D-printed prototype and the integration of a low-latency DSP unit for real-time processing further enhance the practicality of the solution.
The experimental evaluation is robust, encompassing controlled benchtop tests on a mannequin head and real-world user studies across various environments. The authors demonstrate effective noise reduction performance, achieving a mean reduction of 9.6 dB without calibration and 11.2 dB with user-specific calibration. The study involved 11 participants and assessed performance across 8 different environments, showcasing the system's adaptability to diverse acoustic conditions. The use of both objective metrics (e.g., noise reduction levels) and subjective user ratings (e.g., clarity and intrusiveness) strengthens the evaluation.
The paper provides detailed descriptions of the hardware setup, including the specifications of the microphones and DSP units used, as well as the neural network architecture. However, the lack of a publicly accessible demo or project URL limits reproducibility. The authors do mention the use of a calibration procedure, which could be a barrier for replication without access to the same hardware setup.
Key limitations include the system's reduced performance in outdoor environments due to wind noise, which the authors acknowledge as a significant challenge for open-ear designs. Additionally, the reliance on a brief calibration procedure may not be feasible for all users, particularly if the glasses shift during extended wear. The neural network's filter update rate of 200 ms could also hinder responsiveness to rapid changes in the acoustic environment.
The potential applications of this research extend beyond smart glasses to other open-ear wearables, such as augmented and virtual reality headsets. The ability to enhance audio clarity in noisy environments could significantly improve user experience in various contexts, including professional training and everyday use. The findings could also inform future developments in auditory interfaces and personalized hearing assistance technologies. This paper introduces a pioneering ANC system for open-ear smart glasses that operates without error microphones, demonstrating significant noise reduction capabilities in real-world settings. The innovative methodology and thorough experimental evaluation contribute meaningfully to the field of audio processing and wearable technology, paving the way for future advancements in auditory interfaces.
Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse sentiment cues into a single compact representation before sentiment reasoning. This early aggregation makes it difficult to preserve the internal structure of sentiment evidence, where different cues may complement, conflict with, or differ in reliability from each other. In addition, modality importance is often determined only once during fusion, so later reasoning cannot further adjust modality contributions. To address these issues, we propose PRISM, a framework that unifies structured affective extraction and adaptive modality evaluation. PRISM organizes multimodal evidence in a shared prototype space, which supports structured cross-modal comparison and adaptive fusion. It further applies dynamic modality reweighting during reasoning, allowing modality contributions to be continuously refined as semantic interactions become deeper. Experiments on three benchmark datasets show that PRISM outperforms representative baselines.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Zhongguancun Academy
The main contribution of this paper is the introduction of the PRISM framework, which effectively organizes multimodal sentiment evidence into structured prototypes, allowing for adaptive evaluation and dynamic reweighting of modality contributions. This approach significantly advances the state-of-the-art in multimodal sentiment analysis, providing a robust methodology that could influence future research and applications in the field.
The proposed PRISM framework innovatively addresses the limitations of existing multimodal sentiment analysis methods by introducing a shared sentiment prototype bank that organizes multimodal evidence into structured affective components. This design allows for adaptive modality evaluation and dynamic reweighting, enhancing the model's ability to capture nuanced sentiment cues across different modalities. The methodology is well-articulated, with clear explanations of how each component interacts within the framework, particularly the cross-attention mechanism and the dynamic modality reweighting process.
The experiments conducted on three benchmark datasets (CMU-MOSI, CMU-MOSEI, and CH-SIMS) demonstrate the effectiveness of the PRISM framework, showing significant improvements over various baseline models. The use of ablation studies to assess the contribution of each component adds rigor to the evaluation, confirming the necessity of the proposed methods for achieving optimal performance. The results are compelling, with PRISM outperforming established approaches across multiple metrics.
The paper provides sufficient implementation details, including the architecture, training procedures, and hyperparameter settings, which enhances reproducibility. The availability of the code on GitHub further supports this aspect, allowing other researchers to replicate the experiments and validate the findings.
While the paper presents a strong framework, it does not extensively discuss potential limitations, such as the scalability of the model to larger datasets or its performance in real-world applications outside the benchmark settings. Additionally, the reliance on pre-extracted features may limit the model's adaptability to different input modalities or domains.
The PRISM framework has significant implications for fields such as affective computing, human-computer interaction, and content understanding, where accurate sentiment analysis is crucial. By improving multimodal sentiment analysis, this work could enhance applications in social media monitoring, customer feedback analysis, and interactive AI systems that require nuanced understanding of human emotions. The main contribution of this paper is the introduction of the PRISM framework, which effectively organizes multimodal sentiment evidence into structured prototypes, allowing for adaptive evaluation and dynamic reweighting of modality contributions. This approach significantly advances the state-of-the-art in multimodal sentiment analysis, providing a robust methodology that could influence future research and applications in the field.
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group, Chinese Academy of Sciences, Beijing Institute of Technology, Shanghai Jiao Tong University
The paper presents the AT-ADD challenge, a comprehensive evaluation framework for audio deepfake detection that addresses existing gaps in robustness and generalization across audio types. This work is significant as it lays the groundwork for advancing audio forensic technologies, promoting secure communication and reliable media verification in the face of growing synthetic audio threats.
The methodology presented in the paper is robust, proposing a structured evaluation framework for audio deepfake detection that includes two distinct tracks focusing on speech and all-type audio. The challenge is designed to address the limitations of existing benchmarks by incorporating real-world conditions and diverse audio types. The datasets are well-constructed, ensuring a comprehensive evaluation of the proposed countermeasures (CMs) under various conditions, which enhances the reliability of the results.
The experimental evaluation is thorough, with a clear description of dataset composition, including the number of samples and the diversity of audio types. The inclusion of multiple state-of-the-art generation methods for both real and fake audio in the evaluation sets allows for a rigorous assessment of the CMs' performance. Baseline models are provided, which facilitate fair comparisons and establish a strong foundation for future research.
The paper emphasizes reproducibility by providing official implementations of baseline models and clear rules regarding data usage. The closed setting for the challenge ensures that participants can only use the provided datasets, which minimizes variability and enhances the reliability of the results. However, the paper could benefit from more detailed implementation instructions or links to code repositories for the proposed methods.
One limitation of the proposed challenge is that it may not fully capture the complexity of real-world audio deepfake scenarios, especially in terms of environmental variability and user-generated content. Additionally, the focus on specific audio types may overlook other emerging forms of audio manipulation. The challenge's closed setting might also restrict innovative approaches that could leverage external data.
The AT-ADD challenge has significant implications for the field of audio forensics and security, as it aims to improve the robustness and generalizability of audio deepfake detection systems. By addressing the challenges associated with diverse audio types and real-world conditions, the challenge promotes the development of technologies that can enhance media verification and secure communication in an era of increasing synthetic audio generation. The paper presents the AT-ADD challenge, a comprehensive evaluation framework for audio deepfake detection that addresses existing gaps in robustness and generalization across audio types. This work is significant as it lays the groundwork for advancing audio forensic technologies, promoting secure communication and reliable media verification in the face of growing synthetic audio threats.
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Hello Group Inc.
The main contribution of this paper is the introduction of CapTalk, a unified framework for voice design that effectively integrates single-utterance and dialogue generation, achieving state-of-the-art results while addressing key challenges in expressive speech synthesis. The comprehensive methodology and rigorous experimental evaluation position this work as a significant advancement in the field of machine learning and speech generation.
The paper introduces CapTalk, a unified caption-conditioned text-audio autoregressive framework that innovatively extends voice design to dialogue settings. The methodology effectively incorporates hierarchical variational conditioning to balance stable timbre preservation and context-adaptive expression, which is a significant advancement over existing methods that primarily focus on single-utterance generation. The use of CoT control sequences for explicit turn-level expressive control is a novel approach that enhances the model's ability to handle dynamic dialogue contexts.
The experiments demonstrate that CapTalk achieves state-of-the-art performance on single-utterance voice design benchmarks and shows improved expression controllability and contextual appropriateness in multi-turn dialogue. The evaluation protocol is comprehensive, utilizing both human evaluations and automatic metrics, which strengthens the reliability of the results. The paper provides detailed comparisons with existing models, showcasing the advantages of CapTalk through various metrics.
The paper outlines the architecture and training objectives clearly, which aids in reproducibility. However, the reliance on a specific multimodal model (Qwen3-Omni) for caption generation could limit the generalizability of the results if the model's performance varies. The authors plan to release caption annotations and a subset of data, which will further enhance reproducibility.
The paper acknowledges limitations related to the quality of the caption generation process and the emotional expressiveness of the training data, which primarily consists of natural conversational speech. These factors may impact the model's performance in more expressive settings. Additionally, the evaluation benchmarks for dialogue are still developing, which may affect the assessment of the model's capabilities.
CapTalk has the potential to significantly impact the fields of conversational AI and speech synthesis by enabling more natural and context-aware dialogue systems. The ability to generate expressive speech from textual descriptions could enhance applications in virtual assistants, gaming, and interactive storytelling, making human-computer interactions more engaging and realistic. The main contribution of this paper is the introduction of CapTalk, a unified framework for voice design that effectively integrates single-utterance and dialogue generation, achieving state-of-the-art results while addressing key challenges in expressive speech synthesis. The comprehensive methodology and rigorous experimental evaluation position this work as a significant advancement in the field of machine learning and speech generation.
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
Primary: Muséum National d'Histoire Naturelle
All Institutions: Muséum National d'Histoire Naturelle, Sebitoli Chimpanzee Project, Uganda Wildlife Authority, Nitidae Association, Centre d'Ecologie et des Sciences de la Conservation, Institut de Systématique, Evolution, Biodiversité
This paper presents DeepForestSound (DFS), a multi-species automatic detection model for passive acoustic monitoring in African tropical forests, demonstrating a significant advancement in biodiversity monitoring techniques. The innovative methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and conservation biology.
The methodology presented in this paper is robust and innovative, utilizing a semi-supervised pipeline to generate labeled datasets from unannotated acoustic recordings. The combination of clustering techniques with manual validation, followed by fine-tuning a pretrained Audio Spectrogram Transformer (AST) using Low-Rank Adaptation (LoRA), is particularly noteworthy. This approach addresses the challenge of limited annotated data in biodiversity monitoring effectively. The detailed steps taken in data collection, processing, and model training demonstrate a comprehensive understanding of the complexities involved in acoustic monitoring in tropical environments.
The experiments are well-structured, with a clear evaluation protocol that includes comparisons with existing models such as BirdNET, Perch v2, and RDet. The results indicate that DFS outperforms these models for non-avian taxa, which is significant given the ecological importance of these species. The use of Average Precision (AP) and best F1 scores as evaluation metrics is appropriate, and the results are presented clearly, highlighting the model's strengths and weaknesses across different taxa.
The paper provides sufficient detail on the implementation of the model, including the datasets used, preprocessing steps, and training configurations. However, the inability to share raw audio recordings due to legal restrictions may limit full reproducibility. The availability of the code and pretrained models on GitHub is a positive aspect that enhances reproducibility.
One limitation identified is the focus on a specific geographic region (Kibale National Park) and the potential lack of generalizability to other tropical forest ecosystems. Additionally, while the semi-supervised clustering approach is effective, the authors acknowledge that a systematic sensitivity analysis of hyperparameters was not conducted, which could affect the robustness of the model. The model's performance on underrepresented species may also be influenced by the limited training data available for those taxa.
The implications of this research are significant for biodiversity conservation, particularly in underrepresented and threatened species within African tropical forests. The development of a task-oriented model like DFS can facilitate more effective monitoring and conservation efforts, potentially leading to better-informed ecological management strategies. The framework's adaptability for future species integration also suggests a scalable approach to biodiversity assessment. This paper presents DeepForestSound (DFS), a multi-species automatic detection model for passive acoustic monitoring in African tropical forests, demonstrating a significant advancement in biodiversity monitoring techniques. The innovative methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and conservation biology.
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using in-the-wild data, which we demonstrate via systems trained using naturally-noisy speech from VoxCeleb.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Carnegie Mellon University
The paper presents a significant advancement in the field of unsupervised speech separation by introducing innovative methodologies that effectively address the challenges posed by noisy training data. The combination of ring mixing and SCER loss represents a promising direction for future research, with the potential to improve the generalization of speech separation systems in real-world applications.
The paper introduces a novel batch construction strategy called "ring mixing" and an auxiliary loss function termed Signal-to-Consistency-Error Ratio (SCER). The methodology effectively addresses the limitations of conventional supervised training in noisy speech separation tasks by breaking the symmetry in the loss function that leads to undesirable optima. The use of multiple mixtures for the same source in training helps in reducing residual noise and improving the generalization of the model to real-world scenarios. The approach is well-justified, with a clear explanation of the problems with existing methods and a logical progression to the proposed solutions.
The experiments conducted on the WHAM! dataset demonstrate significant improvements in denoising capabilities, with results indicating a reduction in residual noise by upwards of half. The evaluation metrics, including SI-SDR and occupancy metrics, provide a comprehensive assessment of the model's performance. The results show that the proposed SCER loss contributes positively to the denoising task while maintaining separation quality, which is a critical aspect of the research.
The paper provides sufficient details regarding the datasets, model architecture, and training configurations, which are essential for reproducibility. However, the lack of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. The hyperparameter settings, particularly for the SCER loss, are mentioned but not extensively tuned, which could affect reproducibility in varying contexts.
One notable limitation is the observed degradation in performance when evaluating on noiseless conditions, suggesting that the model may not generalize well to all scenarios. Additionally, the reliance on specific datasets may limit the applicability of the findings to other types of noisy speech environments. The authors also mention that the SCER loss can lead to local minima, which may hinder optimal performance.
The proposed methods have significant implications for real-world applications in speech separation and denoising, particularly in environments where overlapping speech and background noise are prevalent. The ability to train models using naturally noisy recordings could enhance the robustness of speech processing systems in various applications, including telecommunications, hearing aids, and voice recognition systems. This work opens avenues for further research into unsupervised learning techniques in audio processing. The paper presents a significant advancement in the field of unsupervised speech separation by introducing innovative methodologies that effectively address the challenges posed by noisy training data. The combination of ring mixing and SCER loss represents a promising direction for future research, with the potential to improve the generalization of speech separation systems in real-world applications.
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, AISpeech Ltd, Nanjing University
The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that significantly improves the alignment and adaptation of speech LLMs in low-resource settings. The methodology and results presented demonstrate a meaningful advancement in the efficiency and effectiveness of speech recognition systems, particularly in the context of limited data availability.
The methodology proposed in TASU2 is innovative, focusing on controllable CTC simulation to improve the alignment between text and speech representations. The use of a WER-conditioned approach allows for more precise control over the generated posteriors, which is a significant advancement over previous methods like TASU. The authors effectively integrate a lightweight Transformer architecture to achieve this, which is appropriate for the task. The algorithm is well-structured, and the training signal is designed to closely mimic real acoustic behavior, enhancing the fidelity of the simulation.
The experiments are comprehensive, evaluating TASU2 across various datasets and settings, including low-resource adaptation scenarios. The results demonstrate consistent improvements over the baseline methods, particularly in terms of WER reduction and domain generalization. The paper provides a thorough analysis of the results, including ablation studies that validate the importance of the WER conditioning. However, specific quantitative results (e.g., exact WER scores) were not detailed in the provided text, which could enhance clarity.
The paper outlines the training and evaluation setup, including the architecture of the simulator and the datasets used. However, the absence of a public code repository or detailed implementation instructions limits the reproducibility of the results. Providing a GitHub link or similar would significantly enhance this aspect.
One limitation is the reliance on a teacher ASR system for generating posteriors, which may introduce biases depending on the quality of the ASR model used. Additionally, while the method shows promise in low-resource settings, its performance in extremely low-resource scenarios remains to be fully explored. The paper could also benefit from a discussion on the scalability of the approach to larger datasets or more complex domains.
The proposed TASU2 framework has significant implications for the field of speech recognition, particularly in scenarios where paired audio-text data is scarce. By enabling effective low-resource adaptation, it opens avenues for deploying speech LLMs in diverse languages and dialects, thereby enhancing accessibility and usability in various applications. This could lead to advancements in real-time translation, voice assistants, and other speech-driven technologies. The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that significantly improves the alignment and adaptation of speech LLMs in low-resource settings. The methodology and results presented demonstrate a meaningful advancement in the efficiency and effectiveness of speech recognition systems, particularly in the context of limited data availability.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
Primary: NIO
All Institutions: NIO
The paper presents a novel approach to entropy allocation in LLM-based ASR systems, significantly contributing to the understanding and improvement of model performance. The methodology is well-structured, and the experimental results validate the proposed framework, marking a meaningful advancement in the field of audio processing and machine learning.
The paper introduces an innovative perspective on entropy allocation in LLM-based ASR systems, proposing new metrics (NSE, PAI, CSAI) to analyze the dynamics between speech encoders and LLMs. The multi-stage training strategy, particularly the iterative asynchronous SFT (IA-SFT) stage, is a significant methodological advancement that aims to preserve functional decoupling and mitigate hallucinations. The approach is well-grounded in theoretical considerations and is supported by empirical evidence, making it a robust contribution to the field.
The experiments conducted on Mandarin and English benchmarks demonstrate the effectiveness of the proposed methods, achieving competitive performance with significantly fewer parameters than state-of-the-art models. The paper provides a comprehensive comparison with existing models, showcasing improvements in both recognition accuracy and hallucination rates. The use of diverse datasets strengthens the validity of the results.
The paper includes detailed descriptions of the training procedures, data statistics, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the findings.
While the proposed method shows promise, the paper does not address potential scalability issues when applied to larger datasets or more complex ASR tasks. Additionally, the reliance on specific metrics for evaluation may not capture all aspects of model performance, particularly in real-world scenarios.
The research has significant implications for the deployment of LLM-based ASR systems in real-world applications, particularly in enhancing recognition accuracy while reducing hallucinations. The findings could influence future research directions in ASR and multimodal systems, promoting more efficient and robust architectures. The paper presents a novel approach to entropy allocation in LLM-based ASR systems, significantly contributing to the understanding and improvement of model performance. The methodology is well-structured, and the experimental results validate the proposed framework, marking a meaningful advancement in the field of audio processing and machine learning.
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Primary: Unknown
All Institutions: Unknown
The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
The proposed TG-DP framework effectively decouples the objectives of masked reconstruction and contrastive learning into separate optimization paths. This dual-path approach allows for tailored visibility patterns that enhance cross-modal alignment while mitigating semantic noise and optimization interference. The introduction of a teacher-student mechanism further enriches the training process by providing structured guidance, which is a noteworthy advancement in the field. The methodology is well-structured and addresses existing challenges in audio-visual representation learning.
The experiments are comprehensive, utilizing large-scale datasets such as AudioSet-2M and VGGSound. The results demonstrate significant improvements in zero-shot retrieval performance, achieving state-of-the-art results across various metrics. The ablation studies provide valuable insights into the effectiveness of the proposed components, such as the dual-path structure and teacher-guided masking strategy. However, the paper could benefit from more detailed comparisons with additional baselines to further validate the claims.
The paper provides a clear description of the methodology and experimental setup, including hyperparameters and data preprocessing steps. The availability of code on GitHub enhances reproducibility. However, the lack of detailed information on the training environment and specific configurations may pose challenges for complete replication.
The primary limitation is the unknown primary institution and the lack of citation context, which may hinder the paper's visibility and impact in the academic community. Additionally, the performance improvements, while significant, may still be context-dependent and require further validation across diverse tasks and datasets.
The advancements in audio-visual representation learning have the potential to enhance various applications, including multimedia retrieval, content-based recommendation systems, and interactive AI systems. The proposed framework could lead to more robust models that understand and integrate audio-visual information, paving the way for future research and applications in multimodal AI. The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Shanghai Jiao Tong University, HKUST (GZ), Xidian University
The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work addresses a critical bottleneck in deploying LALMs and offers a robust solution that combines innovative methodologies with thorough experimental validation, marking a meaningful advancement in the field of machine learning for audio processing.
The methodology presented in the paper is innovative, focusing on the unique challenges of Key-Value (KV) cache management in Large Audio-Language Models (LALMs). The authors propose a dual approach that combines audio-aware head allocation with Spectral Score Smoothing (SSS) to enhance the efficiency of KV cache usage. The identification of audio-critical attention heads through attention score analysis is a significant contribution, as it allows for a more nuanced allocation of resources compared to traditional uniform methods. The SSS technique, which employs FFT-based filtering to stabilize importance scores, is particularly noteworthy for its potential to improve performance in dynamic audio contexts.
The experiments are comprehensive and demonstrate the effectiveness of AudioKV across multiple benchmarks, including Automatic Speech Recognition (ASR) and Speech Translation (ST). The results show that AudioKV outperforms existing methods significantly, especially at high compression ratios where other methods fail. The use of diverse datasets and models strengthens the validity of the findings, and the detailed performance metrics provide a clear picture of the advantages of the proposed method.
The paper mentions that the code will be released after acceptance, which is a positive step towards reproducibility. However, the absence of a public demo or project URL limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but the lack of a publicly available codebase at this time is a drawback.
One limitation noted in the paper is the potential for repetition and degeneration in output under high KV cache compression ratios, which could affect the quality of generated text. Additionally, while the method shows promise, its applicability to other modalities beyond audio is not explored, which may limit its generalizability.
The implications of this work are significant for the deployment of LALMs in real-world applications, particularly in resource-constrained environments where efficient memory usage is critical. The techniques developed could lead to advancements in speech recognition and multimodal interactions, potentially enhancing user experiences in various applications such as virtual assistants, transcription services, and interactive audio systems. The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work addresses a critical bottleneck in deploying LALMs and offers a robust solution that combines innovative methodologies with thorough experimental validation, marking a meaningful advancement in the field of machine learning for audio processing.
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Nanjing University, Huawei Technologies Co., Ltd.
The main contribution of this paper is the introduction of EvoTSE, a novel framework for Target Speaker Extraction that dynamically updates speaker enrollments to mitigate speaker confusion and improve performance in challenging audio environments. This work significantly advances the state of the art in TSE, particularly in handling out-of-domain scenarios, and provides a solid foundation for future research in audio processing and speaker identification.
The proposed EvoTSE framework innovatively addresses the limitations of static enrollment in Target Speaker Extraction (TSE) by introducing a dynamic, evolving enrollment mechanism that utilizes historical context to adaptively update speaker cues. The methodology integrates a contextual retriever, backbone extractor, reliability classifier, and memory curator, which collectively enhance the robustness of speaker extraction in long-duration audio scenarios. The approach is well-structured and leverages existing concepts like Retrieval-Augmented Generation (RAG) while extending them into the audio domain, showcasing a thoughtful adaptation of techniques to solve a specific problem in TSE.
The experimental setup is comprehensive, utilizing multiple datasets including WSJ0-2mix, Libri2mix-clean, and a newly constructed Emotional Speech Database (ESD) to evaluate the model's performance across various conditions. The results demonstrate consistent improvements in extraction quality, particularly in out-of-domain scenarios, which is a significant contribution to the field. The use of multiple evaluation metrics, including SI-SDRi and NSR, provides a robust framework for assessing the model's effectiveness.
The paper provides sufficient implementation details, including model configurations and training strategies, which enhance reproducibility. However, the absence of a clear mention of the specific venue or publication may hinder broader accessibility to the research community. The availability of code and checkpoints on GitHub is a positive aspect that supports reproducibility.
One limitation is the reliance on the quality of historical estimates, which may introduce noise if the initial enrollment is poor. Additionally, while the framework shows promise in OOD scenarios, the paper does not extensively discuss the computational complexity or real-time applicability of the EvoTSE framework in practical applications.
The EvoTSE framework has significant implications for real-world applications such as voice assistants, automated transcription services, and any system requiring speaker identification in noisy environments. By improving the robustness of TSE, this work could enhance user experiences in various audio processing applications, particularly in dynamic and emotionally varied contexts. The main contribution of this paper is the introduction of EvoTSE, a novel framework for Target Speaker Extraction that dynamically updates speaker enrollments to mitigate speaker confusion and improve performance in challenging audio environments. This work significantly advances the state of the art in TSE, particularly in handling out-of-domain scenarios, and provides a solid foundation for future research in audio processing and speaker identification.
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource performance. To address this, we propose a semi-supervised framework based on Semantic-Emotional Resonance Embedding (SERE), a cross-lingual dynamic feature paradigm that requires neither target language labels nor translation alignment. Specifically, SERE constructs an emotion-semantic structure using a small number of labeled samples. It learns human emotional experiences through an Instantaneous Resonance Field (IRF), enabling unlabeled samples to self-organize into this structure. This achieves semi-supervised semantic guidance and structural discovery. Additionally, we design a Triple-Resonance Interaction Chain (TRIC) loss to enable the model to reinforce the interaction and embedding capabilities between labeled and unlabeled samples during emotional highlights. Extensive experiments across multiple languages demonstrate the effectiveness of our method, requiring only 5-shot labeling in the source language.
Primary: Xinjiang University
All Institutions: Xinjiang University, Pengcheng Laboratory Xinjiang Network Node, Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center, Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing
The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that effectively utilizes limited labeled data to improve performance across multiple languages. The technical contributions, particularly the novel use of dynamic feature extraction and interaction mechanisms, position this work as a meaningful advancement in the field of machine learning and emotion recognition.
The proposed methodology introduces a novel semi-supervised framework, Semantic-Emotional Resonance Embedding (SERE), which effectively addresses the challenges of cross-lingual speech emotion recognition (CLSER) by leveraging a small number of labeled samples to construct an emotion-semantic structure. The use of the Instantaneous Resonance Field (IRF) and the Triple-Resonance Interaction Chain (TRIC) loss is innovative, allowing for dynamic feature extraction and interaction between labeled and unlabeled data, which enhances the model's ability to generalize across languages.
The experiments are extensive, covering multiple languages and demonstrating the effectiveness of the proposed method with only 5-shot labeling. The results show significant improvements over existing methods, indicating the robustness of the approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and additional metrics to strengthen the evaluation.
While the methodology is described in detail, the lack of a publicly available code repository limits reproducibility. Including implementation details, hyperparameters, and data preprocessing steps would enhance reproducibility.
The paper acknowledges the challenge of emotional pronunciation differences across languages, which can lead to misclassification. Additionally, the reliance on a small number of labeled samples may limit the applicability of the method in more complex scenarios.
The proposed framework has significant implications for low-resource languages in emotional recognition tasks, potentially enhancing multilingual communication technologies and applications in areas such as mental health monitoring, customer service, and human-computer interaction. The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that effectively utilizes limited labeled data to improve performance across multiple languages. The technical contributions, particularly the novel use of dynamic feature extraction and interaction mechanisms, position this work as a meaningful advancement in the field of machine learning and emotion recognition.
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.
Primary: University of California San Diego
All Institutions: University of California San Diego
The paper presents a comprehensive framework for real-time human-AI musical co-performance, utilizing latent diffusion models for generating instrumental accompaniment. The methodology effectively addresses the challenges of latency in generative models, and the results indicate strong potential for practical applications in live music settings.
The paper presents a novel framework for real-time human-AI musical co-performance utilizing latent diffusion models (LDMs) for generating instrumental accompaniment. The methodology is well-structured, combining a MAX/MSP front-end with a Python inference server, which is a significant step in bridging the gap between real-time audio processing and advanced AI models. The sliding-window look-ahead protocol is a clever approach to managing the inherent latency of generative models, allowing for continuous audio generation. The introduction of consistency distillation to reduce sampling time while maintaining audio quality is particularly innovative. However, the paper could benefit from a more detailed exploration of the implications of the look-ahead depth on musical coherence and generation quality.
The experimental setup is robust, utilizing the Slakh2100 dataset and a clear methodology for evaluating musical coherence, beat alignment, and audio quality. The results demonstrate strong performance across various configurations, showcasing the effectiveness of the proposed models in both retrospective and look-ahead regimes. The use of objective metrics such as COCOLA and Beat F1 scores provides a solid foundation for assessing the models' performance. However, the paper lacks a detailed comparison of subjective evaluations alongside the objective metrics, which would enhance the understanding of the models' performance from a listener's perspective.
The authors have made significant efforts to ensure reproducibility by providing access to the model code, pre-trained checkpoints, and detailed descriptions of the experimental setup. The inclusion of GitHub repositories and a demo page further aids in this regard. However, the paper could improve by providing clearer instructions on the setup process for users who may not be familiar with the technologies used, such as MAX/MSP and the specific configurations for the Python inference server.
One limitation of the study is the reliance on a specific dataset (Slakh2100), which may not fully represent the diversity of musical styles and contexts that the system could encounter in real-world applications. Additionally, while the look-ahead mechanism is innovative, it introduces a trade-off between latency and generation quality that may not be fully addressed in the current framework. The paper also does not explore the potential for user customization or adaptation of the system for different musical genres or performance contexts.
The proposed framework has significant implications for the field of music technology and AI, as it opens up new avenues for real-time collaboration between human musicians and AI systems. This could lead to enhanced creative possibilities in live performance settings, potentially transforming how music is created and experienced. The integration of AI into live performance also raises questions about authorship and the role of technology in artistic expression, which could spark further research and discussion in the field. The paper presents a comprehensive framework for real-time human-AI musical co-performance, utilizing latent diffusion models for generating instrumental accompaniment. The methodology effectively addresses the challenges of latency in generative models, and the results indicate strong potential for practical applications in live music settings.
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science
The main contribution of this paper is the introduction of the ULTRAS framework, which effectively integrates self-supervised learning techniques for joint modeling of audio and speech signals, showcasing significant improvements in performance across diverse tasks. This work represents a meaningful advancement in the field, addressing existing limitations in audio representation learning and providing a foundation for future research.
The proposed ULTRAS framework introduces a novel approach to self-supervised learning by integrating long-context masking and joint predictive modeling of both spectral and temporal targets. This methodology is a significant advancement over existing models, which typically focus on either temporal or spectral features separately. The use of transformer architecture to encode log-mel spectrograms, combined with a unique loss function that balances spectral and temporal predictions, showcases a well-thought-out design that addresses the limitations of previous models. The masking strategy, which operates over longer audio segments, is particularly innovative and is likely to enhance the model's ability to capture contextual information effectively.
The experiments conducted across a diverse set of speech and audio tasks demonstrate the robustness of the ULTRAS framework. The paper provides comprehensive evaluations using multiple datasets, including LibriSpeech and AudioSet, and compares the performance against established baselines. The results indicate that ULTRAS consistently outperforms these baselines, particularly in scenarios where both speech and audio tasks are involved. The inclusion of ablation studies further strengthens the findings by illustrating the contribution of each component of the proposed method.
The paper outlines the implementation details, including the pre-training and evaluation protocols, which are crucial for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to replicate the results independently. Clearer documentation or a supplementary repository would enhance reproducibility.
One limitation of the study is the reliance on a relatively small dataset for some experiments (200 hours), which may affect the generalizability of the results. Additionally, while the model shows improved performance, it is not clear how it scales with larger datasets or more complex tasks. The paper could also benefit from a more thorough discussion of potential biases in the datasets used.
The ULTRAS framework has the potential to significantly impact the fields of audio and speech processing by providing a unified approach that can be applied across various tasks. Its ability to learn robust representations from both speech and general audio signals could lead to advancements in applications such as automatic speech recognition, emotion recognition, and environmental sound classification. The implications of this work extend to improving the efficiency of training models in low-resource settings, thereby democratizing access to advanced audio processing technologies. The main contribution of this paper is the introduction of the ULTRAS framework, which effectively integrates self-supervised learning techniques for joint modeling of audio and speech signals, showcasing significant improvements in performance across diverse tasks. This work represents a meaningful advancement in the field, addressing existing limitations in audio representation learning and providing a foundation for future research.
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.
Primary: Chittagong University of Engineering and Technology
All Institutions: Chittagong University of Engineering and Technology
The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
The proposed methodology introduces a novel dual-path attention mechanism integrated into a complex-valued frequency transformation network (CFTNet), which is a significant advancement in the field of speech enhancement, particularly for cochlear implant users. The combination of intra-chunk and inter-chunk RNNs with attention modules allows for enhanced modeling of speech and noise dynamics in time-frequency representations. The detailed architecture and the rationale behind the design choices are well articulated, showcasing a thoughtful approach to addressing the limitations of existing models.
The experiments are robust, employing a comprehensive dataset that includes various noise conditions and SNR levels. The evaluation metrics used (STOI, PESQ, SISDR) are appropriate for assessing speech intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and a discussion on the statistical significance of the results.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of a specific dataset and the architecture of the model, there are no code repositories or links to a demo that would allow other researchers to replicate the findings. Providing access to the model and training scripts would greatly enhance reproducibility.
One limitation is the reliance on objective metrics without a thorough subjective evaluation involving human listeners. While objective scores are important, subjective assessments are crucial for applications in speech enhancement, especially for cochlear implant users. Additionally, the model's complexity may limit its applicability in real-time scenarios, which is a critical factor for practical implementations.
The proposed DAT-CFTNet has the potential to significantly improve the quality of life for cochlear implant recipients by enhancing speech intelligibility in noisy environments. This advancement could lead to better communication and social interactions for individuals with hearing impairments. The public availability of the model also encourages further research and development in the field. The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.
Primary: University of Amsterdam
All Institutions: University of Amsterdam, Georgia Institute of Technology, Halmstad University
This paper introduces a pioneering framework for automatic speaker drift detection in synthesized speech, leveraging cosine similarity and LLMs to enhance the coherence of TTS systems. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech synthesis.
The proposed methodology effectively addresses the issue of speaker drift in synthesized speech by formulating it as a binary classification task. The use of cosine similarity to assess speaker consistency is both innovative and theoretically justified, providing a solid foundation for the proposed framework. The integration of large language models (LLMs) for reasoning based on structured representations of similarity scores is a novel approach that bridges low-level acoustic features with high-level cognitive evaluation. The construction of a synthetic benchmark dataset with human-validated annotations is a significant contribution, allowing for systematic evaluation of the proposed method.
The experiments conducted are robust, with a clear evaluation strategy involving multiple state-of-the-art LLMs. The results demonstrate the effectiveness of the proposed framework in detecting speaker drift, outperforming baseline methods. The use of F1 scores and accuracy as evaluation metrics is appropriate, and the ablation studies provide insights into the impact of different design choices on performance. However, the dataset size could be larger to enhance the generalizability of the findings.
The paper provides sufficient details regarding the methodology and experimental setup, allowing for reproducibility. However, the lack of a publicly accessible dataset or code repository limits the ease with which other researchers can replicate the results. Providing a demo or project URL would enhance reproducibility.
One limitation is the reliance on synthetic data, which may not fully capture the complexities of real-world speaker drift scenarios. Additionally, while the theoretical guarantees for cosine similarity are compelling, the practical implications of these guarantees in diverse acoustic environments remain to be explored. The dataset's size and diversity may also restrict the generalization of the findings.
The implications of this research are significant for applications in TTS systems, particularly in enhancing user experience in interactive and long-form speech applications. By addressing speaker drift, the framework can improve the coherence and naturalness of synthesized speech, which is crucial for virtual assistants, audiobooks, and other multimedia applications. The work also opens avenues for further research in speaker consistency and the integration of LLMs in audio processing tasks. This paper introduces a pioneering framework for automatic speaker drift detection in synthesized speech, leveraging cosine similarity and LLMs to enhance the coherence of TTS systems. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech synthesis.
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Fudan University, Wheatland Culture and Media Ltd.
The main contribution of this paper is the introduction of a controllable singing style conversion system that effectively mitigates style leakage and enhances dynamic rendering through innovative methodologies. This work significantly advances the state of the art in singing voice conversion, demonstrating high fidelity and naturalness even with limited training data, and sets a strong foundation for future research in this domain.
The paper introduces a novel approach to singing style conversion that effectively addresses style leakage and dynamic rendering issues through a boundary-aware semantic bottleneck and an explicit technique matrix. The methodology is well-structured, leveraging phoneme-level pooling to enhance control over the conversion process. The use of auxiliary models for high-frequency band completion is particularly innovative, allowing the authors to achieve high fidelity despite data limitations. The integration of targeted pitch processing during inference further enhances the system's performance, demonstrating a comprehensive understanding of the challenges in singing voice conversion.
The experimental setup is robust, with a clear description of the training and evaluation processes. The authors conducted subjective evaluations in the SVCC2025 challenge, achieving the best naturalness score among all submissions, which underscores the effectiveness of their approach. The ablation studies provide valuable insights into the contributions of various components, validating the importance of the boundary-aware pooling and technique matrix in reducing style leakage and improving controllability.
The paper provides sufficient details regarding the methodology and experimental setup, including the training stages and the use of specific models and datasets. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and build upon the findings.
While the paper presents significant advancements, it does not address the potential challenges of generalizing the model to out-of-domain singing styles or the limitations of the dataset used for training. Additionally, the reliance on specific phoneme boundaries and technique annotations may limit the model's applicability in more diverse or less structured datasets.
The advancements in controllable singing style conversion have implications for various applications, including music production, voice synthesis for entertainment, and personalized audio experiences. The techniques developed could also be adapted for other audio processing tasks, contributing to the broader field of generative audio systems. The main contribution of this paper is the introduction of a controllable singing style conversion system that effectively mitigates style leakage and enhances dynamic rendering through innovative methodologies. This work significantly advances the state of the art in singing voice conversion, demonstrating high fidelity and naturalness even with limited training data, and sets a strong foundation for future research in this domain.
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.
Primary: Indian Institute of Technology Kharagpur
All Institutions: Indian Institute of Technology Kharagpur, Norwegian University of Science and Technology (NTNU)
This paper introduces TD-VIM, a novel voice morphing technique that significantly enhances the vulnerability of speaker verification systems, thereby emphasizing the urgent need for improved security protocols in biometric applications. The comprehensive methodology and rigorous experimental validation contribute valuable insights to the field of biometric security and machine learning.
The proposed Time-Domain Voice Identity Morphing (TD-VIM) method innovatively operates at the signal level, allowing for morphing without reliance on feature embeddings or reference text, which addresses limitations found in previous methods. The methodology is well-structured, with clear steps for speaker selection, signal processing, and morphing, making it accessible for replication.
The experiments utilize a robust dataset (MAVS) and benchmark the TD-VIM against multiple speaker verification systems, demonstrating high attack success rates. The use of the Generalized Morphing Attack Potential (G-MAP) metric is a significant contribution, providing a comprehensive measure of vulnerability across different devices and languages.
The authors provide a GitHub repository for the source code and state that the morphed files and original dataset can be obtained upon request, promoting transparency and reproducibility.
The study does not address the potential ethical implications of morphing techniques in biometric security, nor does it explore the long-term effectiveness of the proposed method against evolving SVS technologies. Additionally, the reliance on specific datasets may limit generalizability.
The findings highlight significant vulnerabilities in voice biometric systems, particularly in sensitive applications such as banking and finance, raising awareness about the need for enhanced security measures in biometric verification systems. This paper introduces TD-VIM, a novel voice morphing technique that significantly enhances the vulnerability of speaker verification systems, thereby emphasizing the urgent need for improved security protocols in biometric applications. The comprehensive methodology and rigorous experimental validation contribute valuable insights to the field of biometric security and machine learning.
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
The paper introduces the Anchored Cyclic Generation (ACG) paradigm, which effectively addresses the error accumulation problem in autoregressive models for long-sequence symbolic music generation. The methodology is well-structured, employing a hierarchical approach through the Hi-ACG framework that combines global and local generation strategies. The use of a novel piano token representation enhances efficiency and interpretability. The proposed methods are theoretically sound, supported by mathematical analysis, and demonstrate a clear innovation in the field of music generation.
The experimental evaluation is robust, utilizing both objective and subjective metrics to assess the performance of the proposed models against established baselines. The datasets used (MuseScore and POP909) are appropriate for the task, and the results indicate significant improvements in generation quality, as evidenced by a 34.7% reduction in cosine distance between predicted and ground-truth features. The comprehensive evaluation strategy enhances the credibility of the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate validation of results.
The paper acknowledges limitations in fine-grained control during generation and the potential loss of subtle timing nuances in the piano token representation. Additionally, the focus on piano music may restrict the applicability of the framework to other musical contexts. Future research should address these limitations by integrating more expressive tokens and extending the framework to multi-track music generation.
The proposed ACG paradigm has the potential to significantly advance the field of symbolic music generation, offering new avenues for creating high-quality, structurally coherent music. Its principles could be adapted to other long-sequence generation tasks beyond music, such as text generation and structured content synthesis, thereby broadening its impact across various domains. The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
In Audio-Visual Navigation (AVN), agents must locate sound sources in unseen 3D environments using visual and auditory cues. However, existing methods often struggle with generalization in unseen scenarios, as they tend to overfit to semantic sound features and specific training environments. To address these challenges, we propose the \textbf{Binaural Difference Attention with Action Transition Prediction (BDATP)} framework, which jointly optimizes perception and policy. Specifically, the \textbf{Binaural Difference Attention (BDA)} module explicitly models interaural differences to enhance spatial orientation, reducing reliance on semantic categories. Simultaneously, the \textbf{Action Transition Prediction (ATP)} task introduces an auxiliary action prediction objective as a regularization term, mitigating environment-specific overfitting. Extensive experiments on the Replica and Matterport3D datasets demonstrate that BDATP can be seamlessly integrated into various mainstream baselines, yielding consistent and significant performance gains. Notably, our framework achieves state-of-the-art Success Rates across most settings, with a remarkable absolute improvement of up to 21.6 percentage points in Replica dataset for unheard sounds. These results underscore BDATP's superior generalization capability and its robustness across diverse navigation architectures.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
The proposed BDATP framework introduces two innovative components: the Binaural Difference Attention (BDA) module, which enhances spatial audio perception by focusing on interaural differences, and the Action Transition Prediction (ATP) task, which regularizes policy learning to improve generalization across unseen environments. This dual approach effectively addresses the limitations of existing AVN methods, particularly their tendency to overfit to specific training conditions. The methodology is well-structured, with clear explanations of how each component contributes to the overall framework.
The experiments are comprehensive, utilizing two well-known datasets (Replica and Matterport3D) to evaluate the effectiveness of BDATP. The authors provide a thorough comparison against several state-of-the-art baselines, demonstrating significant performance improvements in both heard and unheard sound categories. The metrics used (Success Rate, Success weighted by Path Length, and Success weighted by Number of Actions) are appropriate for the task and provide a clear picture of the framework's capabilities.
The paper lacks explicit details on the implementation, such as hyperparameters, training procedures, and code availability, which could hinder reproducibility. While the methodology is described in detail, providing access to the code and models would greatly enhance the ability of other researchers to replicate the results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world environments. Additionally, while the proposed methods show strong performance in zero-shot settings, the paper does not address how the framework would perform in dynamic environments with moving sound sources or in multi-agent scenarios.
The BDATP framework has the potential to significantly advance the field of audio-visual navigation, particularly in applications involving robotics and autonomous systems. Its focus on generalization could lead to more robust navigation systems in real-world scenarios, enhancing the capabilities of embodied agents in complex environments. The paper presents a novel framework for enhancing generalization in Audio-Visual Navigation through innovative attention mechanisms and action prediction strategies. The technical contributions are significant, addressing key challenges in the field and demonstrating strong empirical results, though improvements in reproducibility and application scope could further enhance its impact.
We introduce Full-Duplex-Bench-v3 (FDB-v3), a benchmark for evaluating spoken language models under naturalistic speech conditions and multi-step tool use. Unlike prior work, our dataset consists entirely of real human audio annotated for five disfluency categories, paired with scenarios requiring chained API calls across four task domains. We evaluate six model configurations -- GPT-Realtime, Gemini Live 2.5, Gemini Live 3.1, Grok, Ultravox v0.7, and a traditional Cascaded pipeline (Whisper$\rightarrow$GPT-4o$\rightarrow$TTS) -- across accuracy, latency, and turn-taking dimensions. GPT-Realtime leads on Pass@1 (0.600) and interruption avoidance (13.5\%); Gemini Live 3.1 achieves the fastest latency (4.25~s) but the lowest turn-take rate (78.0\%); and the Cascaded baseline, despite a perfect turn-take rate, incurs the highest latency (10.12~s). Across all systems, self-correction handling and multi-step reasoning under hard scenarios remain the most consistent failure modes.
Primary: unknown
All Institutions: unknown
The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
The methodology is robust, introducing a novel benchmark (FDB-v3) that evaluates spoken language models under realistic conditions, utilizing real human audio annotated for disfluencies. The design incorporates multi-step tool use across various domains, which is a significant advancement over previous benchmarks that relied on synthetic data or single-step tasks. The systematic approach to scenario formulation and audio collection enhances the validity of the evaluation.
The experiments are comprehensive, evaluating six different model configurations across multiple dimensions such as accuracy, latency, and turn-taking dynamics. The results are well-presented, showing clear performance differences among models and highlighting specific strengths and weaknesses, particularly in handling disfluencies and multi-step reasoning. The use of deterministic mock APIs for evaluation is a strong point, ensuring that the results are not confounded by external factors.
The paper provides sufficient detail regarding the experimental setup, including the models evaluated and the evaluation metrics used. However, the lack of specific implementation details or code availability limits reproducibility. The benchmark is open and reproducible, which is a positive aspect, but without access to the models, full replication of results may be challenging.
The study acknowledges limitations, such as the fixed server region for cloud-based evaluations and the lack of robustness testing against real-world network anomalies. Additionally, the dataset is relatively small (100 recordings), which may affect generalizability. The focus on specific disfluency categories may also overlook other potential challenges in real-world interactions.
This work has significant implications for the development of real-time voice agents, particularly in enhancing their ability to handle natural speech disfluencies and multi-step tasks. The findings suggest directions for future research, emphasizing the need for models that can balance speed and accuracy in dynamic conversational contexts. The benchmark itself could facilitate further advancements in the field by providing a standardized evaluation framework. The paper introduces Full-Duplex-Bench-v3, a benchmark for evaluating real-time voice agents on multi-step tool execution using natural human speech. This work significantly contributes to the field by addressing the challenges of disfluency handling and tool use in voice interactions, paving the way for more effective and responsive AI systems.
In this paper, we propose Universal Holistic Audio Generation (UniHAGen), a task for synthesizing comprehensive auditory scenes that include both on-screen and off-screen sounds across diverse domains (e.g., ambient events, musical instruments, and human speech). Prior video-conditioned audio generation models typically focus on producing on-screen environmental sounds that correspond to visible sounding events, neglecting off-screen auditory events. While recent holistic joint text-video-to-audio generation models aim to produce auditory scenes with both on- and off-screen sound but they are limited to non-speech sounds, lacking the ability to generate or integrate human speech. To overcome these limitations, we introduce OmniSonic, a flow-matching-based diffusion framework jointly conditioned on video and text. It features a TriAttn-DiT architecture that performs three cross-attention operations to process on-screen environmental sound, off-screen environmental sound, and speech conditions simultaneously, with a Mixture-of-Experts (MoE) gating mechanism that adaptively balances their contributions during generation. Furthermore, we construct UniHAGen-Bench, a new benchmark with over one thousand samples covering three representative on/off-screen speech-environment scenarios. Extensive experiments show that OmniSonic consistently outperforms state-of-the-art approaches on both objective metrics and human evaluations, establishing a strong baseline for universal and holistic audio generation. Project page: https://weiguopian.github.io/OmniSonic_webpage/
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.
The proposed OmniSonic framework introduces a flow-matching-based diffusion model that effectively integrates video and text to generate comprehensive auditory scenes. The TriAttn-DiT architecture is a notable innovation, allowing simultaneous processing of on-screen environmental sounds, off-screen sounds, and speech conditions. The use of a Mixture-of-Experts (MoE) gating mechanism is a sophisticated approach that enhances the model's adaptability during audio generation. This methodology is well-structured and addresses the limitations of previous models, particularly in generating human speech alongside environmental sounds.
The authors present extensive experiments that demonstrate the superiority of OmniSonic over existing state-of-the-art methods. The creation of the UniHAGen-Bench benchmark, which includes over a thousand samples across diverse scenarios, is a significant contribution that facilitates fair evaluation and comparison in the field. The combination of objective metrics and human evaluations provides a robust assessment of the model's performance, although specific metrics used for evaluation could be elaborated further for clarity.
The paper provides a project page with a URL, but lacks detailed implementation specifics in the text that would enhance reproducibility. While the methodology is sound, the absence of code or detailed experimental setups may hinder other researchers from replicating the results.
One limitation is the lack of detailed discussion on the computational resources required for training the OmniSonic model, which could be a barrier for some researchers. Additionally, while the model excels in generating audio from video and text, its performance in more nuanced or complex auditory environments remains to be fully explored.
The ability to generate holistic audio from multimodal inputs has significant implications for various applications, including film and video production, virtual reality, and assistive technologies for the hearing impaired. The advancements in audio generation could lead to more immersive experiences in entertainment and education, making this research highly relevant to both academic and industry stakeholders. The main contribution of this paper is the introduction of OmniSonic, a novel framework for generating comprehensive auditory scenes from video and text inputs, addressing previous limitations in audio generation models. This work significantly advances the field of audio synthesis by integrating multiple modalities and establishing a new benchmark for future research.