Speech deepfake detection is a well-established research field with different models, datasets, and training strategies. However, the lack of standardized implementations and evaluation protocols limits reproducibility, benchmarking, and comparison across studies. In this work, we present DeepFense, a comprehensive, open-source PyTorch toolkit integrating the latest architectures, loss functions, and augmentation pipelines, alongside over 100 recipes. Using DeepFense, we conducted a large-scale evaluation of more than 400 models. Our findings reveal that while carefully curated training data improves cross-domain generalization, the choice of pre-trained front-end feature extractor dominates overall performance variance. Crucially, we show severe biases in high-performing models regarding audio quality, speaker gender, and language. DeepFense is expected to facilitate real-world deployment with the necessary tools to address equitable training data selection and front-end fine-tuning.
Primary: German Research Center for Artificial Intelligence (DFKI)
All Institutions: German Research Center for Artificial Intelligence (DFKI), University of Stuttgart, National Institute of Informatics, Technical University of Berlin
The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
The methodology presented in DeepFense is robust and well-structured, focusing on creating a modular and extensible framework for deepfake audio detection. The use of a configuration-driven design allows for easy experimentation and reproducibility, which is a significant advancement in the field. The integration of over 400 models and 100 recipes enhances the toolkit's utility for researchers. The modular architecture facilitates the isolation of algorithmic innovations from implementation artifacts, which is critical for accurate benchmarking.
The experimental evaluation is extensive, covering a large-scale comparison of 400 models across 13 datasets, which is a notable strength of the paper. The results provide valuable insights into the impact of front-end feature extractors, back-end architectures, and training datasets on model performance. The findings regarding biases in model performance based on audio quality, speaker gender, and language are particularly important for ensuring equitable AI systems.
The paper emphasizes reproducibility through its open-source nature and the provision of a comprehensive toolkit that allows other researchers to replicate experiments easily. The use of a single YAML file for experiment configuration is a strong point, as it simplifies the process of sharing and reproducing results.
While the paper presents a significant advancement, it acknowledges limitations such as the lack of a multi-dataset training pipeline and the focus solely on detection tasks. These limitations suggest areas for future research, including the need for more comprehensive training strategies that can mitigate biases.
The implications of this work are substantial, particularly in the context of increasing concerns about deepfake technology and its potential misuse. By providing a standardized toolkit for deepfake detection, DeepFense can help improve the robustness of systems used in real-world applications, thereby enhancing security and trust in voice biometric systems. The main contribution of this paper is the introduction of DeepFense, a comprehensive, modular, and extensible framework for robust deepfake audio detection that facilitates reproducible research and addresses critical biases in model performance. This work significantly advances the field by providing a standardized toolkit that enhances the ability to benchmark and compare deepfake detection models effectively.
Smart glasses are becoming an increasingly prevalent wearable platform, with audio as a key interaction modality. However, hearing in noisy environments remains challenging because smart glasses are equipped with open-ear speakers that do not seal the ear canal. Furthermore, the open-ear design is incompatible with conventional active noise cancellation (ANC) techniques, which rely on an error microphone inside or at the entrance of the ear canal to measure the residual sound heard after cancellation. Here we present the first real-time ANC system for open-ear smart glasses that suppresses environmental noise using only microphones and miniaturized open-ear speakers embedded in the glasses frame. Our low-latency computational pipeline estimates the noise at the ear from an array of eight microphones distributed around the glasses frame and generates an anti-noise signal in real-time to cancel environmental noise. We develop a custom glasses prototype and evaluate it in a user study across 8 environments under mobility in the 100--1000 Hz frequency range, where environmental noise is concentrated. We achieve a mean noise reduction of 9.6 dB without any calibration, and 11.2 dB with a brief user-specific calibration.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Department of Electrical and Computer Engineering, Carl von Ossietzky Universität Oldenburg, Department of Medical Physics and Acoustics, Zhejiang University, College of Computer Science and Technology
This paper introduces a pioneering ANC system for open-ear smart glasses that operates without error microphones, demonstrating significant noise reduction capabilities in real-world settings. The innovative methodology and thorough experimental evaluation contribute meaningfully to the field of audio processing and wearable technology, paving the way for future advancements in auditory interfaces.
The paper presents a novel approach to active noise cancellation (ANC) specifically designed for open-ear smart glasses, which traditionally face challenges due to their non-occlusive design. The methodology leverages a dual-pipeline architecture that separates the estimation of noise propagation and the generation of anti-noise signals, utilizing a neural network for virtual in-ear sensing. This innovative approach circumvents the need for error microphones, which are typically required in ANC systems, by estimating the sound at the ear from an array of microphones distributed around the glasses frame. The use of a custom 3D-printed prototype and the integration of a low-latency DSP unit for real-time processing further enhance the practicality of the solution.
The experimental evaluation is robust, encompassing controlled benchtop tests on a mannequin head and real-world user studies across various environments. The authors demonstrate effective noise reduction performance, achieving a mean reduction of 9.6 dB without calibration and 11.2 dB with user-specific calibration. The study involved 11 participants and assessed performance across 8 different environments, showcasing the system's adaptability to diverse acoustic conditions. The use of both objective metrics (e.g., noise reduction levels) and subjective user ratings (e.g., clarity and intrusiveness) strengthens the evaluation.
The paper provides detailed descriptions of the hardware setup, including the specifications of the microphones and DSP units used, as well as the neural network architecture. However, the lack of a publicly accessible demo or project URL limits reproducibility. The authors do mention the use of a calibration procedure, which could be a barrier for replication without access to the same hardware setup.
Key limitations include the system's reduced performance in outdoor environments due to wind noise, which the authors acknowledge as a significant challenge for open-ear designs. Additionally, the reliance on a brief calibration procedure may not be feasible for all users, particularly if the glasses shift during extended wear. The neural network's filter update rate of 200 ms could also hinder responsiveness to rapid changes in the acoustic environment.
The potential applications of this research extend beyond smart glasses to other open-ear wearables, such as augmented and virtual reality headsets. The ability to enhance audio clarity in noisy environments could significantly improve user experience in various contexts, including professional training and everyday use. The findings could also inform future developments in auditory interfaces and personalized hearing assistance technologies. This paper introduces a pioneering ANC system for open-ear smart glasses that operates without error microphones, demonstrating significant noise reduction capabilities in real-world settings. The innovative methodology and thorough experimental evaluation contribute meaningfully to the field of audio processing and wearable technology, paving the way for future advancements in auditory interfaces.
Multimodal sentiment analysis (MSA) aims to predict human sentiment from textual, acoustic, and visual information in videos. Recent studies improve multimodal fusion by modeling modality interaction and assigning different modality weights. However, they usually compress diverse sentiment cues into a single compact representation before sentiment reasoning. This early aggregation makes it difficult to preserve the internal structure of sentiment evidence, where different cues may complement, conflict with, or differ in reliability from each other. In addition, modality importance is often determined only once during fusion, so later reasoning cannot further adjust modality contributions. To address these issues, we propose PRISM, a framework that unifies structured affective extraction and adaptive modality evaluation. PRISM organizes multimodal evidence in a shared prototype space, which supports structured cross-modal comparison and adaptive fusion. It further applies dynamic modality reweighting during reasoning, allowing modality contributions to be continuously refined as semantic interactions become deeper. Experiments on three benchmark datasets show that PRISM outperforms representative baselines.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Zhongguancun Academy
The main contribution of this paper is the introduction of the PRISM framework, which effectively organizes multimodal sentiment evidence into structured prototypes, allowing for adaptive evaluation and dynamic reweighting of modality contributions. This approach significantly advances the state-of-the-art in multimodal sentiment analysis, providing a robust methodology that could influence future research and applications in the field.
The proposed PRISM framework innovatively addresses the limitations of existing multimodal sentiment analysis methods by introducing a shared sentiment prototype bank that organizes multimodal evidence into structured affective components. This design allows for adaptive modality evaluation and dynamic reweighting, enhancing the model's ability to capture nuanced sentiment cues across different modalities. The methodology is well-articulated, with clear explanations of how each component interacts within the framework, particularly the cross-attention mechanism and the dynamic modality reweighting process.
The experiments conducted on three benchmark datasets (CMU-MOSI, CMU-MOSEI, and CH-SIMS) demonstrate the effectiveness of the PRISM framework, showing significant improvements over various baseline models. The use of ablation studies to assess the contribution of each component adds rigor to the evaluation, confirming the necessity of the proposed methods for achieving optimal performance. The results are compelling, with PRISM outperforming established approaches across multiple metrics.
The paper provides sufficient implementation details, including the architecture, training procedures, and hyperparameter settings, which enhances reproducibility. The availability of the code on GitHub further supports this aspect, allowing other researchers to replicate the experiments and validate the findings.
While the paper presents a strong framework, it does not extensively discuss potential limitations, such as the scalability of the model to larger datasets or its performance in real-world applications outside the benchmark settings. Additionally, the reliance on pre-extracted features may limit the model's adaptability to different input modalities or domains.
The PRISM framework has significant implications for fields such as affective computing, human-computer interaction, and content understanding, where accurate sentiment analysis is crucial. By improving multimodal sentiment analysis, this work could enhance applications in social media monitoring, customer feedback analysis, and interactive AI systems that require nuanced understanding of human emotions. The main contribution of this paper is the introduction of the PRISM framework, which effectively organizes multimodal sentiment evidence into structured prototypes, allowing for adaptive evaluation and dynamic reweighting of modality contributions. This approach significantly advances the state-of-the-art in multimodal sentiment analysis, providing a robust methodology that could influence future research and applications in the field.
Spatial audio is fundamental to immersive virtual experiences, yet synthesizing high-fidelity binaural audio from sparse observations remains a significant challenge. Existing methods typically rely on implicit neural representations conditioned on visual priors, which often struggle to capture fine-grained acoustic structures. Inspired by 3D Gaussian Splatting (3DGS), we introduce AudioGS, a novel visual-free framework that explicitly encodes the sound field as a set of Audio Gaussians based on spectrograms. AudioGS associates each time-frequency bin with an Audio Gaussian equipped with dual Spherical Harmonic (SH) coefficients and a decay coefficient. For a target pose, we render binaural audio by evaluating the SH field to capture directionality, incorporating geometry-guided distance attenuation and phase correction, and reconstructing the waveform. Experiments on the Replay-NVAS dataset demonstrate that AudioGS successfully captures complex spatial cues and outperforms state-of-the-art visual-dependent baselines. Specifically, AudioGS reduces the magnitude reconstruction error (MAG) by over 14% and reduces the perceptual quality metric (DPAM) by approximately 25% compared to the best performing visual-guided method.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Institute of Cultural and Creative Industry, Shanghai Jiao Tong University
AudioGS presents a novel approach to binaural audio synthesis that effectively captures spatial cues without relying on visual data. The technical contributions, including the explicit modeling of the sound field and the integration of phase correction, represent a meaningful advancement in the field of audio processing and spatial audio synthesis.
The methodology presented in AudioGS is innovative, as it introduces a visual-free framework for synthesizing binaural audio using a set of Audio Gaussians derived from spectrograms. The use of dual Spherical Harmonic coefficients and a decay coefficient to model directional energy and distance attenuation is a significant advancement over existing methods that rely on visual priors. The explicit representation of the sound field allows for more interpretable modeling and captures complex spatial cues effectively. The integration of geometry-guided phase correction further enhances the realism of the synthesized audio, addressing limitations in phase alignment seen in previous methods.
The experiments are well-structured, utilizing the Replay-NVAS dataset to evaluate the performance of AudioGS against state-of-the-art visual-dependent methods. The quantitative metrics reported, including MAG and DPAM, provide a robust assessment of audio quality and spatial accuracy. The results demonstrate a significant improvement over existing methods, validating the effectiveness of the proposed approach. The inclusion of subjective listening tests adds depth to the evaluation, confirming the objective metrics with human judgments.
The paper provides sufficient implementation details, including training setups, loss functions, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results.
One limitation of the study is its reliance on a specific dataset (Replay-NVAS), which may not generalize to all acoustic environments. Additionally, while the paper discusses future work on extending the framework to dynamic scenes, the current implementation is limited to static environments, which may restrict its applicability in real-world scenarios.
The potential applications of AudioGS are significant, particularly in immersive technologies such as VR, AR, and XR, where high-fidelity spatial audio is crucial for enhancing user experience. The framework could also benefit fields such as gaming, virtual conferencing, and education, where realistic audio environments are increasingly important. AudioGS presents a novel approach to binaural audio synthesis that effectively captures spatial cues without relying on visual data. The technical contributions, including the explicit modeling of the sound field and the integration of phase correction, represent a meaningful advancement in the field of audio processing and spatial audio synthesis.
Audio has rapidly become a primary interface for foundation models, powering real-time voice assistants. Ensuring safety in audio systems is inherently more complex than just "unsafe text spoken aloud": real-world risks can hinge on audio-native harmful sound events, speaker attributes (e.g., child voice), impersonation/voice-cloning misuse, and voice-content compositional harms, such as child voice plus sexual content. The nature of audio makes it challenging to develop comprehensive benchmarks or guardrails against this unique risk landscape. To close this gap, we conduct large-scale red teaming on audio systems, systematically uncover vulnerabilities in audio, and develop a comprehensive, policy-grounded audio risk taxonomy and AudioSafetyBench, the first policy-based audio safety benchmark across diverse threat models. AudioSafetyBench supports diverse languages, suspicious voices (e.g., celebrity/impersonation and child voice), risky voice-content combinations, and non-speech sound events. To defend against these threats, we propose AudioGuard, a unified guardrail consisting of 1) SoundGuard for waveform-level audio-native detection and 2) ContentGuard for policy-grounded semantic protection. Extensive experiments on AudioSafetyBench and four complementary benchmarks show that AudioGuard consistently improves guardrail accuracy over strong audio-LLM-based baselines with substantially lower latency.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
This paper presents a comprehensive framework for audio safety protection through a novel dual-path guardrail system that effectively addresses the unique risks associated with audio inputs and outputs. The technical contributions, including the development of a robust audio risk taxonomy and extensive experimental validation, position this work as a significant advancement in the field of audio machine learning.
The methodology presented in this paper is robust, combining a novel audio risk taxonomy with a dual-path guardrail system (SoundGuard and ContentGuard) that effectively addresses the unique challenges posed by audio safety. The approach is well-structured, leveraging large-scale red teaming to identify vulnerabilities and systematically developing a comprehensive benchmark (AudioSafetyBench) that accommodates diverse threat models. The modular design allows for flexibility and efficiency in deployment, which is a significant advancement in the field.
The experiments are extensive, demonstrating the effectiveness of AudioGuard across multiple benchmarks and showing significant improvements in accuracy and latency compared to existing audio-LLM-based guardrails. The evaluation metrics are well-defined, and the results provide a clear indication of the system's performance across various scenarios, including severe voice-content compositional risks and non-speech harmful sound events.
The paper provides sufficient details regarding the implementation of the models and the training processes, which enhances reproducibility. However, the absence of a publicly available demo or project URL limits the ability for others to replicate the findings directly.
One limitation is the potential dual-use concern regarding the insights gained from the red teaming and taxonomy, which could be exploited by malicious actors. Additionally, while the paper addresses audio-native risks effectively, it may not fully encompass all possible real-world scenarios, particularly as audio technology continues to evolve.
The work has the potential to significantly enhance the safety of audio-capable AI systems, reducing harmful outputs and improving detection of impersonation and child safety risks. The modular design and policy-grounded approach could lead to more transparent and effective safety measures in various applications, including voice assistants and TTS systems. This paper presents a comprehensive framework for audio safety protection through a novel dual-path guardrail system that effectively addresses the unique risks associated with audio inputs and outputs. The technical contributions, including the development of a robust audio risk taxonomy and extensive experimental validation, position this work as a significant advancement in the field of audio machine learning.
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, National Institute of Advanced Industrial Science and Technology (AIST)
The paper presents DialogueSidon, a novel model for recovering full-duplex dialogue tracks from degraded audio, significantly advancing the field of audio processing and dialogue systems. The combination of innovative methodology and strong experimental results positions this work as a meaningful contribution to the ongoing research in speech separation and restoration.
The proposed DialogueSidon model innovatively combines a variational autoencoder (VAE) with a diffusion-based latent predictor to address the dual challenges of restoring and separating degraded monaural two-speaker dialogue audio. This joint approach is well-motivated, as it leverages self-supervised learning (SSL) features to create a compact latent space, which is crucial for effective processing of complex audio mixtures. The methodology is robust, addressing the specific challenges posed by in-the-wild audio, including overlapping speech and background noise, and introduces auxiliary latent predictions to mitigate permutation ambiguity, showcasing a thoughtful design that enhances model performance.
The experiments are comprehensive, utilizing multiple datasets (English, multilingual, and in-the-wild) to evaluate the model's performance across various conditions. The use of both objective metrics (WER, p-CER, speaker similarity) and subjective assessments (MOS) provides a well-rounded evaluation of the model's effectiveness. The results demonstrate significant improvements in intelligibility and separation quality over baseline methods, with a notable reduction in WER and high subjective quality ratings, indicating that the model not only performs well technically but is also preferred by human listeners.
The paper provides sufficient detail regarding the model architecture, training procedures, and evaluation metrics, which facilitates reproducibility. The authors mention the use of specific datasets and the training process on multiple GPUs, which is helpful for others looking to replicate the study. However, the lack of access to the datasets used for training could limit full reproducibility for some researchers.
One limitation is that the model is primarily evaluated on two-speaker dialogue, which may not generalize well to scenarios involving more speakers or different types of dialogue interactions. Additionally, while the model shows improved performance, the reliance on specific SSL features may limit its applicability to other audio domains or languages not covered in the training data.
The potential applications of DialogueSidon are significant, particularly in enhancing the quality of conversational AI systems and improving the accessibility of spoken dialogue data for research and development. By enabling the recovery of clean speaker-wise audio from in-the-wild recordings, this work could facilitate advancements in natural language processing, speech recognition, and human-computer interaction. The paper presents DialogueSidon, a novel model for recovering full-duplex dialogue tracks from degraded audio, significantly advancing the field of audio processing and dialogue systems. The combination of innovative methodology and strong experimental results positions this work as a meaningful contribution to the ongoing research in speech separation and restoration.
We propose a generative framework for multi-track music source separation (MSS) that reformulates the task as conditional discrete token generation. Unlike conventional approaches that directly estimate continuous signals in the time or frequency domain, our method combines a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model to autoregressively generate audio tokens for four target tracks. The generated tokens are decoded back to waveforms through the codec decoder. Evaluation on the MUSDB18-HQ benchmark shows that our generative approach achieves perceptual quality approaching state-of-the-art discriminative methods, while attaining the highest NISQA score on the vocals track. Ablation studies confirm the effectiveness of the learnable Conformer encoder and the benefit of sequential cross-track generation.
Primary: Qwen Applications Business Group of Alibaba
All Institutions: Qwen Applications Business Group of Alibaba
The paper presents a novel generative framework for multi-track music source separation that reformulates the task as autoregressive token prediction, achieving competitive performance against state-of-the-art methods. The methodology is innovative, and the results demonstrate significant potential for advancing the field of audio processing.
The proposed methodology introduces a novel generative framework for multi-track music source separation (MSS) that leverages a combination of a Conformer-based conditional encoder, a dual-path neural audio codec (HCodec), and a decoder-only language model. This approach effectively reformulates the MSS task into a discrete token generation problem, which is a significant departure from traditional continuous signal estimation methods. The use of residual vector quantization (RVQ) to represent target tracks as interleaved acoustic and semantic tokens is innovative and allows for autoregressive generation of multiple tracks in a single run, enhancing the model's ability to capture cross-track dependencies. The architecture is well-structured, and the integration of a language model for audio token generation is a promising direction that could influence future research in audio processing.
The experiments are conducted on the MUSDB18-HQ benchmark, which is a recognized dataset for evaluating music source separation methods. The authors provide comprehensive evaluations using perceptual metrics like ViSQOL, DNSMOS, and NISQA, demonstrating competitive performance against state-of-the-art discriminative methods. The results indicate that the proposed generative approach achieves perceptual quality comparable to existing methods, particularly excelling in vocal track separation. The ablation studies further validate the effectiveness of key components of the framework, such as the learnable Conformer encoder and the benefits of sequential cross-track generation.
The paper provides detailed implementation details, including model architecture, training configurations, and evaluation metrics, which are essential for reproducibility. However, the reliance on pseudo-labels generated by a baseline model (BS-RoFormer) raises concerns about the quality and reliability of the training data, which could affect reproducibility in practice. The authors do not provide a public code repository, which limits the ability for others to replicate the results directly.
The paper acknowledges several limitations, including challenges in separating percussive sources with sharp transients, which are difficult for the autoregressive generation paradigm. The reliance on pseudo-labels may introduce biases and limit the performance upper bound. Additionally, the dual-path codec architecture with multiple layers of RVQ can lead to cumulative errors, affecting the quality of the final output.
The proposed framework has significant implications for various applications in music technology, including music remixing, transcription, and karaoke generation. By advancing the state of the art in multi-track music source separation, this research could enhance user experiences in music production and accessibility for hearing-impaired individuals. The integration of language models into audio processing also opens avenues for further exploration in multimodal AI systems. The paper presents a novel generative framework for multi-track music source separation that reformulates the task as autoregressive token prediction, achieving competitive performance against state-of-the-art methods. The methodology is innovative, and the results demonstrate significant potential for advancing the field of audio processing.
Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcription quality and question answering performance. In practice, stronger attacks often come at the cost of degraded utility. To study this trade-off, we revisit existing attacks by varying their perturbation coverage in the frequency domain, from partial-band to full-band, and find that broader frequency coverage does not necessarily improve jailbreak performance, while utility consistently deteriorates. This suggests that concentrating perturbation on a subset of bands can yield a better attack-utility trade-off than indiscriminate full-band coverage. Based on this insight, we propose GRM, a utility-aware frequency-selective jailbreak framework. It ranks Mel bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns a reusable universal perturbation under a semantic-preservation objective. Experiments on four representative ALLMs show that GRM achieves an average Jailbreak Success Rate (JSR) of 88.46% while providing a better attack-utility trade-off than representative baselines. These results highlight the potential of frequency-selective perturbation for better balancing attack effectiveness and utility preservation in audio jailbreak. Content Warning: This paper includes harmful query examples and unsafe model responses.
Primary: Sun Yat-Sen University
All Institutions: Sun Yat-Sen University
The main contribution of this paper is the introduction of GRM, a utility-aware frequency-selective jailbreak framework for audio LLMs, which balances attack effectiveness and utility preservation. This work represents a meaningful advancement in the understanding and mitigation of vulnerabilities in audio-based AI systems, highlighting the importance of considering utility in adversarial settings.
The proposed method, GRM (Gradient Ratio Masking), introduces a novel framework for audio jailbreak attacks that emphasizes a utility-aware approach. By focusing on frequency-selective perturbation, the authors demonstrate a systematic method for identifying key Mel frequency bands that contribute to jailbreak effectiveness while minimizing utility degradation. The dual-gradient scoring mechanism for band selection is a unique aspect that enhances the attack-utility trade-off. The methodology is well-structured, with clear definitions and a comprehensive explanation of the optimization process, making it a significant advancement in the field of audio adversarial attacks.
The experiments are robust, involving multiple representative ALLMs and a thorough evaluation of the attack's effectiveness through metrics such as Jailbreak Success Rate (JSR), Word Error Rate (WER), and Response Quality Score (RQS). The results indicate that GRM achieves a high average JSR of 88.46% while maintaining better utility preservation compared to existing baselines. The ablation studies provide insight into the contributions of different components of the GRM framework, further validating the effectiveness of the proposed method.
The paper provides detailed implementation details, including the experimental setup, datasets, model configurations, and evaluation protocols. However, the lack of a publicly accessible code repository or demo URL limits the reproducibility of the results. The authors mention using specific models and datasets, which could be challenging for other researchers to replicate without access to the same resources.
The study acknowledges several limitations, including potential bias in the evaluation metrics due to reliance on LLM-based judges and the lack of validation in real-world environments. Additionally, the method's effectiveness is primarily model-specific, with limited cross-model transferability observed. These factors may restrict the generalizability of the findings.
The implications of this research are significant, particularly in the context of audio security and the safety of audio large language models (ALLMs). By improving the attack-utility trade-off, the findings could inform the development of more robust defenses against audio jailbreak attacks. The methodology could also be applied to enhance the safety and reliability of audio-based AI systems in various applications, including voice assistants and interactive systems. The main contribution of this paper is the introduction of GRM, a utility-aware frequency-selective jailbreak framework for audio LLMs, which balances attack effectiveness and utility preservation. This work represents a meaningful advancement in the understanding and mitigation of vulnerabilities in audio-based AI systems, highlighting the importance of considering utility in adversarial settings.
Audio super-resolution aims to recover missing high-frequency details from bandwidth-limited low-resolution audio, thereby improving the naturalness and perceptual quality of the reconstructed signal. However, most existing methods directly operate in the waveform or time-frequency domain, which not only involves high-dimensional generation spaces but is also largely limited to speech tasks, leaving substantial room for improvement on more complex audio types such as sound effects and music. To mitigate these limitations, we introduce LatentFlowSR, a new audio super-resolution approach that leverages conditional flow matching (CFM) within a latent representation space. Specifically, we first train a noise-robust autoencoder, which encodes low-resolution audio into a continuous latent space. Conditioned on the low-resolution latent representation, a CFM mechanism progressively generates the corresponding high-resolution latent representation from a Gaussian prior with a one-step ordinary differential equation (ODE) solver. The resulting high-resolution latent representation is then decoded by the pretrained autoencoder to reconstruct the high-resolution audio. Experimental results demonstrate that LatentFlowSR consistently outperforms baseline methods across various audio types and super-resolution settings. These results indicate that the proposed method possesses strong high-frequency reconstruction capability and robust generalization performance, providing compelling evidence for the effectiveness of latent-space modeling in audio super-resolution. All relevant code will be made publicly available upon completion of the paper review process.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China
The main contribution of this paper is the introduction of LatentFlowSR, an innovative audio super-resolution method that effectively utilizes latent-space modeling and conditional flow matching to achieve high-fidelity audio reconstruction. This work represents a significant advancement in the field of audio processing, particularly in its ability to handle diverse audio types and improve computational efficiency.
The methodology proposed in LatentFlowSR is innovative, leveraging a noise-robust autoencoder to encode low-resolution audio into a latent space, followed by a conditional flow matching (CFM) mechanism to generate high-resolution audio. The use of an ODE solver for generating high-resolution latent representations is a novel approach that enhances computational efficiency. The architecture incorporates a U-Net style for estimating the velocity field, which is well-suited for capturing both local and global audio features. The integration of noise robustness during training further strengthens the model's performance in real-world applications.
The experimental evaluation is comprehensive, utilizing diverse datasets that include speech, sound effects, and music, which allows for a thorough assessment of the model's capabilities. The results demonstrate significant improvements over baseline methods in both objective metrics (LSD, LSD-HF, ViSQOL) and subjective evaluations (MOS), indicating strong performance across various audio types and degradation levels. The paper also includes a detailed analysis of computational complexity, showcasing the efficiency of the LatentFlowSR model.
The paper outlines the implementation details and training strategies clearly, which aids in reproducibility. The authors mention the use of specific datasets, training steps, and optimization techniques, which are essential for others to replicate their work. However, the lack of a public code repository at the time of review may hinder immediate reproducibility.
One limitation is the reliance on a specific architecture and training strategy, which may not generalize well to all audio super-resolution tasks. Additionally, while the model shows strong performance on the tested datasets, further validation on more diverse and challenging datasets would be beneficial to fully assess its generalization capabilities.
The proposed LatentFlowSR has significant implications for various applications in audio processing, including speech synthesis, music restoration, and sound effects enhancement. Its ability to recover high-frequency details can improve the quality of audio in consumer products, entertainment, and communication technologies. The methodology could also inspire further research into latent-space modeling in other domains of machine learning. The main contribution of this paper is the introduction of LatentFlowSR, an innovative audio super-resolution method that effectively utilizes latent-space modeling and conditional flow matching to achieve high-fidelity audio reconstruction. This work represents a significant advancement in the field of audio processing, particularly in its ability to handle diverse audio types and improve computational efficiency.
Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.
Primary: Japan Advanced Institute of Science and Technology
All Institutions: Japan Advanced Institute of Science and Technology
This paper presents a novel method for mitigating hallucinations in auditory large language models through the innovative use of noise as contextual guidance. The comprehensive methodology and robust experimental results indicate a meaningful contribution to the field of audio understanding and generative modeling, addressing a critical challenge in the deployment of ALLMs.
The proposed Noise-Aware In-Context Learning (NAICL) method introduces an innovative approach to mitigate hallucinations in Auditory Large Language Models (ALLMs) by utilizing a structured noise prior library. This method effectively guides the model to adopt more conservative outputs when acoustic evidence is insufficient, which is a significant departure from traditional fine-tuning approaches that are computationally expensive. The methodology is well-structured, involving a detailed process of dataset filtering, noise retrieval, and contextual integration, which enhances the interpretability and effectiveness of the model.
The experiments are comprehensive, utilizing the newly constructed Clotho-1K benchmark to evaluate the performance of various ALLMs. The results demonstrate a significant reduction in hallucination rates, providing strong empirical support for the effectiveness of the NAICL method. The inclusion of an ablation study further strengthens the findings by analyzing different configurations and their impacts on performance, showcasing the robustness of the approach.
The paper provides sufficient implementation details, including the use of a specific acoustic encoder (BEATs) and clear descriptions of the retrieval process. The availability of code on GitHub enhances reproducibility, although further details on hyperparameter settings and training procedures would be beneficial for complete replication.
One limitation is the reliance on the Clotho dataset, which may not encompass the full diversity of real-world audio scenarios, potentially limiting the generalizability of the findings. Additionally, while the method shows promise in reducing hallucinations, it may introduce its own biases by overly constraining the model's outputs in uncertain contexts.
The implications of this research are significant for the development of reliable audio understanding systems, particularly in applications requiring high accuracy in audio captioning and interpretation. By addressing hallucination issues, this work could enhance the deployment of ALLMs in real-world scenarios, such as assistive technologies and automated content generation, thereby improving user trust and system performance. This paper presents a novel method for mitigating hallucinations in auditory large language models through the innovative use of noise as contextual guidance. The comprehensive methodology and robust experimental results indicate a meaningful contribution to the field of audio understanding and generative modeling, addressing a critical challenge in the deployment of ALLMs.
Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned projector that maps encoder features into the LLM embedding space, whereas an alternative is to expose discrete phoneme sequences to the LLM. Using the same encoder and LLM backbones, we compare phoneme-based and vanilla projector-based interfaces in high-resource English and low-resource Tatar. We also propose a BPE-phoneme interface that groups frequent local phoneme patterns while preserving explicit word-boundary cues for phoneme-to-grapheme generation. On LibriSpeech, the phoneme-based interface is competitive with the vanilla projector, and the BPE-phoneme interface yields further gains. On Tatar, the phoneme-based interface substantially outperforms the vanilla projector. We further find that phoneme supervision yields a phoneme-informed hybrid interface that is stronger than the vanilla projector.
Primary: Tsinghua University
All Institutions: Tsinghua University, TasiTech Co., Ltd., Xinjiang University
The main contribution of this paper is the comparative analysis of phoneme-based and projector-based interfaces for LLM-integrated ASR, revealing that phoneme-based approaches can significantly enhance performance, especially in low-resource settings. This work advances the understanding of speech-language interfaces and provides a foundation for future innovations in ASR systems.
The paper presents a systematic comparison of two speech-language interfaces—projector-based and phoneme-based—for integrating large language models (LLMs) with automatic speech recognition (ASR). The methodology is robust, utilizing controlled backbones and extensive experiments across high-resource (LibriSpeech) and low-resource (Tatar) settings. The introduction of a BPE-phoneme interface is particularly innovative, as it combines phoneme sequences with boundary-awareness, enhancing the model's performance. The two-stage training process for both interfaces is well-defined and leverages phoneme supervision effectively.
The experiments are comprehensive, covering a variety of configurations and datasets. The results demonstrate that the phoneme-based interface significantly outperforms the vanilla projector, especially in low-resource scenarios, which is a critical finding for the field. The paper provides clear metrics (Word Error Rate - WER) and contextualizes results against recent baselines, showcasing the competitive performance of the proposed methods.
The paper outlines the experimental setup, including details on datasets, model architectures, and training procedures. However, the absence of a public repository or demo limits reproducibility. The authors mention the use of specific models and configurations but do not provide code or data access, which is essential for full reproducibility in machine learning research.
One limitation is the lack of exploration into the scalability of the proposed interfaces beyond the tested languages and datasets. Additionally, while the BPE-phoneme interface shows promise, its effectiveness in other languages or dialects has not been evaluated. The paper also does not address potential biases in the datasets used, which could impact generalizability.
The findings have significant implications for ASR systems, particularly in low-resource languages, where traditional methods often struggle. The insights gained from this study could inform future research and development in multilingual ASR, potentially leading to more inclusive and accessible speech technologies. The main contribution of this paper is the comparative analysis of phoneme-based and projector-based interfaces for LLM-integrated ASR, revealing that phoneme-based approaches can significantly enhance performance, especially in low-resource settings. This work advances the understanding of speech-language interfaces and provides a foundation for future innovations in ASR systems.
We present AccompGen, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, AccompGen produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50,Hz for vocals and EnCodec acoustic tokens at 75,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization.
Primary: Zhejiang Lab
All Institutions: Zhejiang Lab, University of Science and Technology of China, Huawei Technologies Co., Ltd.
AccompGen presents a novel approach to vocal accompaniment generation through a hierarchical autoregressive model that leverages dual-rate codec tokenization and modern Transformer techniques. This work significantly advances the field of audio generation by providing a robust framework for creating high-quality instrumental music that complements vocal performances.
The methodology presented in AccompGen is innovative, particularly with its dual-rate codec tokenization and hierarchical autoregressive architecture. The use of HuBERT and EnCodec for semantic and acoustic tokenization respectively allows for a nuanced representation of both vocals and instrumentals, which is critical for generating coherent music. The three-stage approach effectively decomposes the generation task, allowing for more controlled and precise outputs. The incorporation of modern Transformer design choices enhances the model's performance and stability, making it a significant advancement over previous methods.
The experiments conducted on the MUSDB18 dataset demonstrate the effectiveness of AccompGen, achieving a Fréchet Audio Distance (FAD) score that matches state-of-the-art systems while using significantly fewer parameters. This is a strong indicator of the model's efficiency and effectiveness. However, the paper could benefit from more extensive qualitative evaluations, such as user studies or subjective listening tests, to complement the objective metrics provided.
The paper provides a detailed description of the model architecture, training configurations, and data preprocessing steps, which aids in reproducibility. However, the lack of a publicly available code repository or demo limits the ability for other researchers to replicate the results directly. Providing access to the trained models or code would greatly enhance reproducibility.
One limitation is the reliance on specific datasets (MUSDB18 and FMA-Large) for training and evaluation, which may not fully represent the diversity of vocal styles and musical genres. Additionally, the model's performance in real-world applications, where vocal inputs may vary significantly in quality and style, remains untested. The paper also does not address potential biases in the training data that could affect the model's outputs.
The ability to generate instrumental accompaniments from vocal inputs has significant implications for music creation, democratizing music production and enabling non-musicians to create personalized music. This technology could be applied in various fields, including music education, entertainment, and therapy. However, ethical considerations regarding copyright and the potential for misuse in generating music that mimics existing artists should be addressed. AccompGen presents a novel approach to vocal accompaniment generation through a hierarchical autoregressive model that leverages dual-rate codec tokenization and modern Transformer techniques. This work significantly advances the field of audio generation by providing a robust framework for creating high-quality instrumental music that complements vocal performances.
Abusive speech detection is becoming increasingly important as social media shifts towards voice-based interaction, particularly in multilingual and low-resource settings. Most current systems rely on automatic speech recognition (ASR) followed by text-based hate speech classification, but this pipeline is vulnerable to transcription errors and discards prosodic information carried in speech. We investigate whether Contrastive Language-Audio Pre-training (CLAP) can support abusive speech detection directly from audio. Using the ADIMA dataset, we evaluate CLAP-based representations under few-shot supervised contrastive adaptation in cross-lingual and leave-one-language-out settings, with zero-shot prompting included as an auxiliary analysis. Our results show that CLAP yields strong cross-lingual audio representations across ten Indic languages, and that lightweight projection-only adaptation achieves competitive performance with respect to fully supervised systems trained on complete training data. However, the benefits of few-shot adaptation are language-dependent and not monotonic with shot size. These findings suggest that contrastive audio-text models provide a promising basis for cross-lingual audio abuse detection in low-resource settings, while also indicating that transfer remains incomplete and language-specific in important ways.
Primary: Télécom SudParis
All Institutions: Télécom SudParis, Polytechnique de Paris
This paper presents a significant advancement in the field of abusive speech detection by proposing a novel approach that leverages few-shot learning and contrastive audio-text representations. The methodology and results contribute valuable insights into the challenges of detecting abusive speech in low-resource languages, highlighting the potential for effective cross-lingual transfer and adaptation.
The methodology presented in this paper is innovative in its application of Contrastive Language-Audio Pre-training (CLAP) for abusive speech detection directly from audio, bypassing traditional ASR pipelines. The authors effectively leverage few-shot learning techniques to adapt the model to low-resource languages, which is crucial given the linguistic diversity in the target application. The use of projection-only adaptation versus projection+fine-tuning is a thoughtful approach that allows for exploration of trade-offs in performance and computational efficiency. The research questions are well-defined, guiding the exploration of the model's capabilities across various languages and adaptation scenarios.
The experiments are comprehensive, utilizing the ADIMA dataset to evaluate the proposed methods across multiple languages and settings. The authors provide detailed results for both few-shot and zero-shot conditions, with a clear focus on macro-F1 scores as the primary evaluation metric. The analysis of language-specific performance and the leave-one-language-out (LOLO) approach adds depth to the evaluation, revealing insights into the model's transferability and robustness across different linguistic contexts. However, the reliance on a single dataset may limit the generalizability of the findings.
The paper includes sufficient implementation details, such as the use of PyTorch and HuggingFace for model training, as well as fixed random seeds for reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which other researchers can replicate the results. Including such resources would enhance the paper's impact and facilitate further exploration of the proposed methods.
The authors acknowledge several limitations, including the focus on a single dataset, which may not capture the full diversity of abusive speech across different contexts and languages. Additionally, the few-shot results may be sensitive to support-set composition and optimization choices, particularly in very low-shot regimes. The paper also lacks a dedicated fairness analysis, which is critical given the cultural sensitivity surrounding abusive language.
The implications of this research are significant, particularly in the context of moderating abusive speech in multilingual and low-resource settings. By developing a model that can effectively detect abusive speech directly from audio, the work addresses a pressing societal need in the age of voice-based social media. The findings could inform the development of more robust moderation tools that respect linguistic diversity and cultural nuances. This paper presents a significant advancement in the field of abusive speech detection by proposing a novel approach that leverages few-shot learning and contrastive audio-text representations. The methodology and results contribute valuable insights into the challenges of detecting abusive speech in low-resource languages, highlighting the potential for effective cross-lingual transfer and adaptation.
Detecting video deepfakes has become increasingly urgent in recent years. Given the audio-visual information in videos, existing methods typically expose deepfakes by modeling cross-modal correspondence using specifically designed architectures with publicly available datasets. While they have shown promising results, their effectiveness often degrades in real-world scenarios, as the limited diversity of training datasets naturally restricts generalizability to unseen cases. To address this, we propose a simple yet effective method, called AVPF, which can notably enhance model generalizability by training with self-generated Audio-Visual Pseudo-Fakes.The key idea of AVPF is to create pseudo-fake training samples that contain diverse audio-visual correspondence patterns commonly observed in real-world deepfakes. We highlight that AVPF is generated solely from authentic samples, and training relies only on authentic data and AVPF, without requiring any real deepfakes.Extensive experiments on multiple standard datasets demonstrate the strong generalizability of the proposed method, achieving an average performance improvement of up to 7.4%.
Primary: Ocean University of China
All Institutions: Ocean University of China
The main contribution of this paper is the introduction of a novel method for generating audio-visual pseudo-fakes that enhances the generalizability of video deepfake detection models. This work represents a meaningful advancement in the field, addressing a critical challenge in deepfake detection by leveraging authentic data to create diverse training samples that better reflect real-world scenarios.
The proposed methodology introduces a novel self-generated Audio-Visual Pseudo-Fake (AVPF) strategy that enhances the generalizability of video deepfake detection by simulating both inter- and intra-modality inconsistencies. The two key strategies, Audio-Visual Self-Blending (AVSB) and Audio-Visual Self-Splicing (AVSS), are well-conceived, leveraging authentic data to create pseudo-fake samples that reflect real-world deepfake characteristics. The approach is straightforward yet effective, relying solely on authentic samples, which is a significant departure from existing methods that require real deepfake data. The methodology is clearly articulated, with detailed descriptions of the processes involved in generating pseudo-fakes, making it replicable for future research.
The experiments are extensive, covering multiple standard datasets, including FakeAVCeleb, AV-Deepfake1M, AVLips, and TalkingHeadBench. The paper reports an average performance improvement of up to 7.4%, which is a significant enhancement over existing methods. The results are well-presented, with comparisons to state-of-the-art methods, demonstrating the effectiveness of the proposed approach. The ablation studies provide valuable insights into the contributions of each component of the methodology, reinforcing the robustness of the findings.
The paper provides sufficient implementation details, including the use of specific datasets, the architecture of the model, and the training parameters. However, the lack of a publicly available code repository or demo URL limits reproducibility. Future work should consider sharing code and datasets to facilitate validation of the results by the research community.
One limitation noted is that while the method improves detection generalizability, it does not address the localization of forgery within videos. This could be a significant drawback for practical applications where identifying the specific manipulated segments is crucial. Additionally, the reliance on authentic samples may still limit the diversity of training data, as the method may not cover all possible deepfake manipulation techniques.
The proposed method has significant implications for the field of multimedia forensics and security, particularly in combating the growing threat of deepfake technology. By improving detection capabilities, the research contributes to safeguarding authenticity in various domains, including journalism, social media, and legal contexts. The approach also opens avenues for further research into multi-modal deepfake detection strategies. The main contribution of this paper is the introduction of a novel method for generating audio-visual pseudo-fakes that enhances the generalizability of video deepfake detection models. This work represents a meaningful advancement in the field, addressing a critical challenge in deepfake detection by leveraging authentic data to create diverse training samples that better reflect real-world scenarios.
The rapid advancement of Audio Large Language Models (ALLMs) has enabled cost-effective, high-fidelity generation and manipulation of both speech and non-speech audio, including sound effects, singing voices, and music. While these capabilities foster creativity and content production, they also introduce significant security and trust challenges, as realistic audio deepfakes can now be generated and disseminated at scale. Existing audio deepfake detection (ADD) countermeasures (CMs) and benchmarks, however, remain largely speech-centric, often relying on speech-specific artifacts and exhibiting limited robustness to real-world distortions, as well as restricted generalization to heterogeneous audio types and emerging spoofing techniques. To address these gaps, we propose the All-Type Audio Deepfake Detection (AT-ADD) Grand Challenge for ACM Multimedia 2026, designed to bridge controlled academic evaluation with practical multimedia forensics. AT-ADD comprises two tracks: (1) Robust Speech Deepfake Detection, which evaluates detectors under real-world scenarios and against unseen, state-of-the-art speech generation methods; and (2) All-Type Audio Deepfake Detection, which extends detection beyond speech to diverse, unknown audio types and promotes type-agnostic generalization across speech, sound, singing, and music. By providing standardized datasets, rigorous evaluation protocols, and reproducible baselines, AT-ADD aims to accelerate the development of robust and generalizable audio forensic technologies, supporting secure communication, reliable media verification, and responsible governance in an era of pervasive synthetic audio.
Primary: Communication University of China
All Institutions: Communication University of China, Ant Group, Chinese Academy of Sciences, Beijing Institute of Technology, Shanghai Jiao Tong University
The paper presents the AT-ADD challenge, a comprehensive evaluation framework for audio deepfake detection that addresses existing gaps in robustness and generalization across audio types. This work is significant as it lays the groundwork for advancing audio forensic technologies, promoting secure communication and reliable media verification in the face of growing synthetic audio threats.
The methodology presented in the paper is robust, proposing a structured evaluation framework for audio deepfake detection that includes two distinct tracks focusing on speech and all-type audio. The challenge is designed to address the limitations of existing benchmarks by incorporating real-world conditions and diverse audio types. The datasets are well-constructed, ensuring a comprehensive evaluation of the proposed countermeasures (CMs) under various conditions, which enhances the reliability of the results.
The experimental evaluation is thorough, with a clear description of dataset composition, including the number of samples and the diversity of audio types. The inclusion of multiple state-of-the-art generation methods for both real and fake audio in the evaluation sets allows for a rigorous assessment of the CMs' performance. Baseline models are provided, which facilitate fair comparisons and establish a strong foundation for future research.
The paper emphasizes reproducibility by providing official implementations of baseline models and clear rules regarding data usage. The closed setting for the challenge ensures that participants can only use the provided datasets, which minimizes variability and enhances the reliability of the results. However, the paper could benefit from more detailed implementation instructions or links to code repositories for the proposed methods.
One limitation of the proposed challenge is that it may not fully capture the complexity of real-world audio deepfake scenarios, especially in terms of environmental variability and user-generated content. Additionally, the focus on specific audio types may overlook other emerging forms of audio manipulation. The challenge's closed setting might also restrict innovative approaches that could leverage external data.
The AT-ADD challenge has significant implications for the field of audio forensics and security, as it aims to improve the robustness and generalizability of audio deepfake detection systems. By addressing the challenges associated with diverse audio types and real-world conditions, the challenge promotes the development of technologies that can enhance media verification and secure communication in an era of increasing synthetic audio generation. The paper presents the AT-ADD challenge, a comprehensive evaluation framework for audio deepfake detection that addresses existing gaps in robustness and generalization across audio types. This work is significant as it lays the groundwork for advancing audio forensic technologies, promoting secure communication and reliable media verification in the face of growing synthetic audio threats.
Voice design from natural language descriptions is emerging as a new task in text-to-speech multimodal generation, aiming to synthesize speech with target timbre and speaking style without relying on reference audio. However, existing methods mainly focus on single-utterance generation, leaving conversational voice design largely unexplored. In this work, we extend voice design to dialogue, enabling better target speaker modeling and turn-level expressive control in natural conversational settings. We propose CapTalk, a unified caption-conditioned text-audio autoregressive framework for both single-utterance and dialogue voice design. CapTalk uses utterance-level captions for single-utterance voice design and speaker-level captions for dialogue speaker modeling, and further introduces a CoT control sequence in dialogue to explicitly plan turn-level dynamic attributes. To resolve the conflict between stable timbre preservation and context-adaptive expression, we propose a hierarchical variational conditioning module with an utterance-level speaker encoder to better balance stable timbre preservation and context-adaptive expression. This enables timbre reuse while keeping expression adaptive to the current utterance and, in dialogue, the surrounding context. We also build a comprehensive evaluation protocol for both single-utterance and dialogue settings. Experiments show that CapTalk achieves state-of-the-art performance on a single-utterance voice design benchmark and delivers better expression controllability and contextual appropriateness in multi-turn dialogue. Audio samples are available at: https://anonymous.4open.science/api/repo/Captalk-D601/file/index.html.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Hello Group Inc.
The main contribution of this paper is the introduction of CapTalk, a unified framework for voice design that effectively integrates single-utterance and dialogue generation, achieving state-of-the-art results while addressing key challenges in expressive speech synthesis. The comprehensive methodology and rigorous experimental evaluation position this work as a significant advancement in the field of machine learning and speech generation.
The paper introduces CapTalk, a unified caption-conditioned text-audio autoregressive framework that innovatively extends voice design to dialogue settings. The methodology effectively incorporates hierarchical variational conditioning to balance stable timbre preservation and context-adaptive expression, which is a significant advancement over existing methods that primarily focus on single-utterance generation. The use of CoT control sequences for explicit turn-level expressive control is a novel approach that enhances the model's ability to handle dynamic dialogue contexts.
The experiments demonstrate that CapTalk achieves state-of-the-art performance on single-utterance voice design benchmarks and shows improved expression controllability and contextual appropriateness in multi-turn dialogue. The evaluation protocol is comprehensive, utilizing both human evaluations and automatic metrics, which strengthens the reliability of the results. The paper provides detailed comparisons with existing models, showcasing the advantages of CapTalk through various metrics.
The paper outlines the architecture and training objectives clearly, which aids in reproducibility. However, the reliance on a specific multimodal model (Qwen3-Omni) for caption generation could limit the generalizability of the results if the model's performance varies. The authors plan to release caption annotations and a subset of data, which will further enhance reproducibility.
The paper acknowledges limitations related to the quality of the caption generation process and the emotional expressiveness of the training data, which primarily consists of natural conversational speech. These factors may impact the model's performance in more expressive settings. Additionally, the evaluation benchmarks for dialogue are still developing, which may affect the assessment of the model's capabilities.
CapTalk has the potential to significantly impact the fields of conversational AI and speech synthesis by enabling more natural and context-aware dialogue systems. The ability to generate expressive speech from textual descriptions could enhance applications in virtual assistants, gaming, and interactive storytelling, making human-computer interactions more engaging and realistic. The main contribution of this paper is the introduction of CapTalk, a unified framework for voice design that effectively integrates single-utterance and dialogue generation, achieving state-of-the-art results while addressing key challenges in expressive speech synthesis. The comprehensive methodology and rigorous experimental evaluation position this work as a significant advancement in the field of machine learning and speech generation.
Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/
Primary: Ulsan National Institute of Science and Technology (UNIST)
All Institutions: Ulsan National Institute of Science and Technology (UNIST)
The paper presents a novel approach to emotion editing in talking face videos through Cross-Modal Emotion Transfer (C-MET), significantly advancing the field by enabling the synthesis of extended emotions from audio inputs. The methodology is well-structured, and the experimental validation demonstrates its effectiveness, making it a valuable contribution to the machine learning community.
The proposed Cross-Modal Emotion Transfer (C-MET) method is innovative in its approach to emotion editing in talking face videos by leveraging emotion semantic vectors derived from audio and visual modalities. The methodology effectively addresses the limitations of existing methods by enabling the generation of extended emotions without requiring extensive labeled datasets. The use of a contrastive learning framework to align audio and visual representations is a notable strength, as it enhances the model's ability to generalize across different emotional expressions.
The experiments conducted on the MEAD and CREMA-D datasets are comprehensive, demonstrating significant improvements in emotion accuracy over state-of-the-art methods. The quantitative metrics, such as \(Acc_{emo}\), alongside qualitative assessments from user studies, provide a robust evaluation of the model's performance. The results indicate that C-MET not only achieves higher accuracy but also maintains visual fidelity and synchronization, which are critical for practical applications.
The paper includes sufficient implementation details, such as the choice of encoders, training protocols, and loss functions, which facilitate reproducibility. The availability of code and demo links further supports this aspect, although the actual code repository is not provided in the text.
The model's reliance on a minimum number of speech samples for stable performance could limit its applicability in scenarios with limited data. Additionally, the current focus on English datasets restricts the model's generalizability to multilingual contexts. The inability to handle multi-view identity images is another notable limitation that could affect the model's robustness in diverse applications.
The ability to generate expressive talking face videos has significant implications for fields such as virtual reality, gaming, and telecommunication, where emotional engagement is crucial. The advancements in emotion editing can enhance human-computer interaction, making virtual agents more relatable and effective in applications like education and therapy. The potential for integrating this technology into various multimedia platforms could lead to more immersive and empathetic user experiences. The paper presents a novel approach to emotion editing in talking face videos through Cross-Modal Emotion Transfer (C-MET), significantly advancing the field by enabling the synthesis of extended emotions from audio inputs. The methodology is well-structured, and the experimental validation demonstrates its effectiveness, making it a valuable contribution to the machine learning community.
Passive Acoustic Monitoring (PAM) is widely used for biodiversity assessment. Its application in African tropical forests is limited by scarce annotated data, reducing the performance of general-purpose ecoacoustic models on underrepresented taxa. In this study, we introduce DeepForestSound (DFS), a multi-species automatic detection model designed for PAM in African tropical forests. DFS relies on a semi-supervised pipeline combining clustering of unannotated recordings with manual validation, followed by supervised fine-tuning of an Audio Spectrogram Transformer (AST) using low-rank adaptation, which is compared to a frozen-backbone linear baseline (DFS-Linear). The framework supports the detection of multiple taxonomic groups, including birds, primates, and elephants, from long-term acoustic recordings. DFS was trained on acoustic data collected in the Sebitoli area, in Kibale National Park, Uganda, and evaluated on an independent dataset recorded two years later at different locations within the same forest. This evaluation therefore assesses generalization across time and recording sites within a single tropical forest ecosystem. Across 8 out of 12 taxons, DFS outperforms existing automatic detection tools, particularly for non-avian taxa, achieving average AP values of 0.964 for primates and 0.961 for elephants. Results further show that LoRA-based fine-tuning substantially outperforms linear probing across taxa. Overall, these results demonstrate that task-oriented, region-specific training substantially improves detection performance in acoustically complex tropical environments, and highlight the potential of DFS as a practical tool for biodiversity monitoring and conservation in African rainforests.
Primary: Muséum National d'Histoire Naturelle
All Institutions: Muséum National d'Histoire Naturelle, Sebitoli Chimpanzee Project, Uganda Wildlife Authority, Nitidae Association, Centre d'Ecologie et des Sciences de la Conservation, Institut de Systématique, Evolution, Biodiversité
This paper presents DeepForestSound (DFS), a multi-species automatic detection model for passive acoustic monitoring in African tropical forests, demonstrating a significant advancement in biodiversity monitoring techniques. The innovative methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and conservation biology.
The methodology presented in this paper is robust and innovative, utilizing a semi-supervised pipeline to generate labeled datasets from unannotated acoustic recordings. The combination of clustering techniques with manual validation, followed by fine-tuning a pretrained Audio Spectrogram Transformer (AST) using Low-Rank Adaptation (LoRA), is particularly noteworthy. This approach addresses the challenge of limited annotated data in biodiversity monitoring effectively. The detailed steps taken in data collection, processing, and model training demonstrate a comprehensive understanding of the complexities involved in acoustic monitoring in tropical environments.
The experiments are well-structured, with a clear evaluation protocol that includes comparisons with existing models such as BirdNET, Perch v2, and RDet. The results indicate that DFS outperforms these models for non-avian taxa, which is significant given the ecological importance of these species. The use of Average Precision (AP) and best F1 scores as evaluation metrics is appropriate, and the results are presented clearly, highlighting the model's strengths and weaknesses across different taxa.
The paper provides sufficient detail on the implementation of the model, including the datasets used, preprocessing steps, and training configurations. However, the inability to share raw audio recordings due to legal restrictions may limit full reproducibility. The availability of the code and pretrained models on GitHub is a positive aspect that enhances reproducibility.
One limitation identified is the focus on a specific geographic region (Kibale National Park) and the potential lack of generalizability to other tropical forest ecosystems. Additionally, while the semi-supervised clustering approach is effective, the authors acknowledge that a systematic sensitivity analysis of hyperparameters was not conducted, which could affect the robustness of the model. The model's performance on underrepresented species may also be influenced by the limited training data available for those taxa.
The implications of this research are significant for biodiversity conservation, particularly in underrepresented and threatened species within African tropical forests. The development of a task-oriented model like DFS can facilitate more effective monitoring and conservation efforts, potentially leading to better-informed ecological management strategies. The framework's adaptability for future species integration also suggests a scalable approach to biodiversity assessment. This paper presents DeepForestSound (DFS), a multi-species automatic detection model for passive acoustic monitoring in African tropical forests, demonstrating a significant advancement in biodiversity monitoring techniques. The innovative methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and conservation biology.
Conversational AI has made significant progress, yet generating expressive and controllable text-to-speech (TTS) remains challenging. Specifically, controlling fine-grained voice styles and emotions is notoriously difficult and typically requires massive amounts of heavily annotated training data. To overcome this data bottleneck, we present a scalable, data-efficient cascaded framework that pairs textual style tokens with human-curated, high-quality audio prompts. This approach enables single-shot adaptation to fine-grained speaking styles and character voices. In the context of TTS, this audio prompting acts as In-Context Learning (ICL), guiding the model's prosody and timbre without requiring massive parameter updates or large-scale retraining. To further enhance generation quality and mitigate hallucinations, we introduce a novel ICL-based online reinforcement learning (RL) strategy. This strategy directly optimizes the autoregressive prosody model using subjective aesthetic rewards while being constrained by Connectionist Temporal Classification (CTC) alignment to preserve intelligibility. Comprehensive human perception evaluations demonstrate significant improvements in both the naturalness and expressivity of the synthesized speech, establishing the efficacy of our ICL-based online RL approach.
Primary: Meta AI
All Institutions: Meta AI
The paper presents a novel cascaded framework for enhancing conversational TTS through ICL and online reinforcement learning, significantly improving the expressivity and naturalness of synthesized speech. The technical contributions, including innovative methodologies and thorough experimental evaluations, position this work as a meaningful advancement in the field of conversational AI and TTS systems.
The proposed methodology introduces a cascaded framework that utilizes textual style tokens and audio prompts for fine-grained control over TTS expressivity. The integration of In-Context Learning (ICL) allows for single-shot adaptation, which is a significant advancement in reducing the data requirements typically associated with expressive TTS systems. The novel ICL-based online reinforcement learning strategy optimizes the autoregressive prosody model using subjective aesthetic rewards while maintaining intelligibility through CTC alignment, showcasing a sophisticated approach to mitigating common issues in TTS systems.
The experiments are robust, employing comprehensive human perception evaluations that assess naturalness and expressivity across multiple dimensions. The use of a comparative Mean Opinion Score (CMOS) and a structured rating protocol based on paralinguistic dimensions adds rigor to the evaluation process. The results demonstrate substantial improvements over baseline models, indicating the effectiveness of the proposed methods.
The paper provides a clear description of the experimental setup, including the selection of audio prompts and the training process for the models. However, the lack of publicly available datasets or code may hinder full reproducibility. The authors could enhance reproducibility by providing access to their training data and model implementations.
One limitation is the reliance on human-curated audio prompts, which may introduce subjectivity and variability in the results. Additionally, while the proposed methods show improvements, the scalability of the approach in real-world applications and its performance across diverse languages and accents remain to be fully explored.
The advancements in expressive TTS have significant implications for various applications, including virtual assistants, audiobooks, and interactive entertainment. By enabling more natural and expressive speech synthesis, this research could enhance user experiences in conversational AI systems and contribute to the development of more engaging and human-like interactions. The paper presents a novel cascaded framework for enhancing conversational TTS through ICL and online reinforcement learning, significantly improving the expressivity and naturalness of synthesized speech. The technical contributions, including innovative methodologies and thorough experimental evaluations, position this work as a meaningful advancement in the field of conversational AI and TTS systems.
Noisy speech separation systems are typically trained on fully-synthetic mixtures, limiting generalization to real-world scenarios. Though training on mixtures of in-domain (thus often noisy) speech is possible, we show that this leads to undesirable optima where mixture noise is retained in the estimates, due to the inseparability of the background noises and the loss function's symmetry. To address this, we propose ring mixing, a batch strategy of using each source in two mixtures, alongside a new Signal-to-Consistency-Error Ratio (SCER) auxiliary loss penalizing inconsistent estimates of the same source from different mixtures, breaking symmetry and incentivizing denoising. On a WHAM!-based benchmark, our method can reduce residual noise by upwards of half, effectively learning to denoise from only noisy recordings. This opens the door to training more generalizable systems using in-the-wild data, which we demonstrate via systems trained using naturally-noisy speech from VoxCeleb.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Carnegie Mellon University
The paper presents a significant advancement in the field of unsupervised speech separation by introducing innovative methodologies that effectively address the challenges posed by noisy training data. The combination of ring mixing and SCER loss represents a promising direction for future research, with the potential to improve the generalization of speech separation systems in real-world applications.
The paper introduces a novel batch construction strategy called "ring mixing" and an auxiliary loss function termed Signal-to-Consistency-Error Ratio (SCER). The methodology effectively addresses the limitations of conventional supervised training in noisy speech separation tasks by breaking the symmetry in the loss function that leads to undesirable optima. The use of multiple mixtures for the same source in training helps in reducing residual noise and improving the generalization of the model to real-world scenarios. The approach is well-justified, with a clear explanation of the problems with existing methods and a logical progression to the proposed solutions.
The experiments conducted on the WHAM! dataset demonstrate significant improvements in denoising capabilities, with results indicating a reduction in residual noise by upwards of half. The evaluation metrics, including SI-SDR and occupancy metrics, provide a comprehensive assessment of the model's performance. The results show that the proposed SCER loss contributes positively to the denoising task while maintaining separation quality, which is a critical aspect of the research.
The paper provides sufficient details regarding the datasets, model architecture, and training configurations, which are essential for reproducibility. However, the lack of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. The hyperparameter settings, particularly for the SCER loss, are mentioned but not extensively tuned, which could affect reproducibility in varying contexts.
One notable limitation is the observed degradation in performance when evaluating on noiseless conditions, suggesting that the model may not generalize well to all scenarios. Additionally, the reliance on specific datasets may limit the applicability of the findings to other types of noisy speech environments. The authors also mention that the SCER loss can lead to local minima, which may hinder optimal performance.
The proposed methods have significant implications for real-world applications in speech separation and denoising, particularly in environments where overlapping speech and background noise are prevalent. The ability to train models using naturally noisy recordings could enhance the robustness of speech processing systems in various applications, including telecommunications, hearing aids, and voice recognition systems. This work opens avenues for further research into unsupervised learning techniques in audio processing. The paper presents a significant advancement in the field of unsupervised speech separation by introducing innovative methodologies that effectively address the challenges posed by noisy training data. The combination of ring mixing and SCER loss represents a promising direction for future research, with the potential to improve the generalization of speech separation systems in real-world applications.
Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, AISpeech Ltd, Nanjing University
The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that significantly improves the alignment and adaptation of speech LLMs in low-resource settings. The methodology and results presented demonstrate a meaningful advancement in the efficiency and effectiveness of speech recognition systems, particularly in the context of limited data availability.
The methodology proposed in TASU2 is innovative, focusing on controllable CTC simulation to improve the alignment between text and speech representations. The use of a WER-conditioned approach allows for more precise control over the generated posteriors, which is a significant advancement over previous methods like TASU. The authors effectively integrate a lightweight Transformer architecture to achieve this, which is appropriate for the task. The algorithm is well-structured, and the training signal is designed to closely mimic real acoustic behavior, enhancing the fidelity of the simulation.
The experiments are comprehensive, evaluating TASU2 across various datasets and settings, including low-resource adaptation scenarios. The results demonstrate consistent improvements over the baseline methods, particularly in terms of WER reduction and domain generalization. The paper provides a thorough analysis of the results, including ablation studies that validate the importance of the WER conditioning. However, specific quantitative results (e.g., exact WER scores) were not detailed in the provided text, which could enhance clarity.
The paper outlines the training and evaluation setup, including the architecture of the simulator and the datasets used. However, the absence of a public code repository or detailed implementation instructions limits the reproducibility of the results. Providing a GitHub link or similar would significantly enhance this aspect.
One limitation is the reliance on a teacher ASR system for generating posteriors, which may introduce biases depending on the quality of the ASR model used. Additionally, while the method shows promise in low-resource settings, its performance in extremely low-resource scenarios remains to be fully explored. The paper could also benefit from a discussion on the scalability of the approach to larger datasets or more complex domains.
The proposed TASU2 framework has significant implications for the field of speech recognition, particularly in scenarios where paired audio-text data is scarce. By enabling effective low-resource adaptation, it opens avenues for deploying speech LLMs in diverse languages and dialects, thereby enhancing accessibility and usability in various applications. This could lead to advancements in real-time translation, voice assistants, and other speech-driven technologies. The main contribution of this paper is the introduction of TASU2, a controllable CTC simulation framework that significantly improves the alignment and adaptation of speech LLMs in low-resource settings. The methodology and results presented demonstrate a meaningful advancement in the efficiency and effectiveness of speech recognition systems, particularly in the context of limited data availability.
Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchmarks, it remains challenging to balance recognition quality with latency and overhead, while hallucinations further limit real-world deployment. In this study, we revisit LLM-based ASR from an entropy allocation perspective and introduce three metrics to characterize how training paradigms allocate entropy reduction between the speech encoder and the LLM. To remedy entropy-allocation inefficiencies in prevailing approaches, we propose a principled multi-stage training strategy grounded in capability-boundary awareness, optimizing parameter efficiency and hallucination robustness. Specifically, we redesign the pretraining strategy to alleviate the speech-text modality gap, and further introduce an iterative asynchronous SFT stage between alignment and joint SFT to preserve functional decoupling and constrain encoder representation drift. Experiments on Mandarin and English benchmarks show that our method achieves competitive performance with state-of-the-art models using only 2.3B parameters, while also effectively mitigating hallucinations through our decoupling-oriented design.
Primary: NIO
All Institutions: NIO
The paper presents a novel approach to entropy allocation in LLM-based ASR systems, significantly contributing to the understanding and improvement of model performance. The methodology is well-structured, and the experimental results validate the proposed framework, marking a meaningful advancement in the field of audio processing and machine learning.
The paper introduces an innovative perspective on entropy allocation in LLM-based ASR systems, proposing new metrics (NSE, PAI, CSAI) to analyze the dynamics between speech encoders and LLMs. The multi-stage training strategy, particularly the iterative asynchronous SFT (IA-SFT) stage, is a significant methodological advancement that aims to preserve functional decoupling and mitigate hallucinations. The approach is well-grounded in theoretical considerations and is supported by empirical evidence, making it a robust contribution to the field.
The experiments conducted on Mandarin and English benchmarks demonstrate the effectiveness of the proposed methods, achieving competitive performance with significantly fewer parameters than state-of-the-art models. The paper provides a comprehensive comparison with existing models, showcasing improvements in both recognition accuracy and hallucination rates. The use of diverse datasets strengthens the validity of the results.
The paper includes detailed descriptions of the training procedures, data statistics, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository or demo URL limits the practical reproducibility of the findings.
While the proposed method shows promise, the paper does not address potential scalability issues when applied to larger datasets or more complex ASR tasks. Additionally, the reliance on specific metrics for evaluation may not capture all aspects of model performance, particularly in real-world scenarios.
The research has significant implications for the deployment of LLM-based ASR systems in real-world applications, particularly in enhancing recognition accuracy while reducing hallucinations. The findings could influence future research directions in ASR and multimodal systems, promoting more efficient and robust architectures. The paper presents a novel approach to entropy allocation in LLM-based ASR systems, significantly contributing to the understanding and improvement of model performance. The methodology is well-structured, and the experimental results validate the proposed framework, marking a meaningful advancement in the field of audio processing and machine learning.
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.
Primary: Unknown
All Institutions: Unknown
The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
The proposed TG-DP framework effectively decouples the objectives of masked reconstruction and contrastive learning into separate optimization paths. This dual-path approach allows for tailored visibility patterns that enhance cross-modal alignment while mitigating semantic noise and optimization interference. The introduction of a teacher-student mechanism further enriches the training process by providing structured guidance, which is a noteworthy advancement in the field. The methodology is well-structured and addresses existing challenges in audio-visual representation learning.
The experiments are comprehensive, utilizing large-scale datasets such as AudioSet-2M and VGGSound. The results demonstrate significant improvements in zero-shot retrieval performance, achieving state-of-the-art results across various metrics. The ablation studies provide valuable insights into the effectiveness of the proposed components, such as the dual-path structure and teacher-guided masking strategy. However, the paper could benefit from more detailed comparisons with additional baselines to further validate the claims.
The paper provides a clear description of the methodology and experimental setup, including hyperparameters and data preprocessing steps. The availability of code on GitHub enhances reproducibility. However, the lack of detailed information on the training environment and specific configurations may pose challenges for complete replication.
The primary limitation is the unknown primary institution and the lack of citation context, which may hinder the paper's visibility and impact in the academic community. Additionally, the performance improvements, while significant, may still be context-dependent and require further validation across diverse tasks and datasets.
The advancements in audio-visual representation learning have the potential to enhance various applications, including multimedia retrieval, content-based recommendation systems, and interactive AI systems. The proposed framework could lead to more robust models that understand and integrate audio-visual information, paving the way for future research and applications in multimodal AI. The paper presents a novel Teacher-Guided Dual-Path framework for audio-visual representation learning, significantly improving state-of-the-art performance in zero-shot retrieval tasks. The comprehensive methodology and experimental validation highlight its potential impact on the field, addressing critical challenges in cross-modal alignment and semantic noise reduction.
Word error rate (WER) is the dominant metric for automatic speech recognition, yet it cannot detect a systematic failure mode: models that produce fluent output in the wrong writing system. We define Script Fidelity Rate (SFR), the fraction of hypothesis characters in the target script block, computable without reference transcriptions, and report the first systematic measurement of script collapse across six languages spanning four writing systems (Pashto, Urdu, Hindi, Bengali, Malayalam, Somali) and nine ASR models on FLEURS test sets. Across 53 evaluated model-language pairs, 18 (34%; 95% Wilson CI: 23-47%) exhibit script collapse (SFR < 10%); MMS-1B and SeamlessM4T-v2 maintain SFR above 99% on every language evaluated, confirming that SFR correctly identifies high fidelity where it is present. We identify three distinct collapse patterns: Latin phonetic substitution (smaller Whisper on Indic languages), Arabic substitution for Somali's Latin-script orthography, and Devanagari substitution where larger Whisper models treat all Indic audio as Hindi, a failure present even in Whisper large-v3.
Primary: Independent Researcher
All Institutions: Independent Researcher
The paper presents a novel metric, Script Fidelity Rate (SFR), that effectively measures the fidelity of ASR outputs in multilingual contexts, addressing a critical gap in existing evaluation methodologies. The comprehensive analysis of technical contributions, methodology, and significance to the field underscores the potential for SFR to enhance the reliability of ASR systems across diverse languages and scripts.
The paper introduces a novel metric, Script Fidelity Rate (SFR), which addresses a critical gap in the evaluation of automatic speech recognition (ASR) systems, particularly for multilingual contexts. The methodology is well-defined, relying on Unicode block membership to assess the fidelity of output scripts without requiring reference transcriptions. This approach allows for a continuous evaluation in production settings, which is a significant advancement over traditional metrics like WER that do not account for script fidelity. The empirical taxonomy of collapse patterns adds depth to the analysis, providing insights into specific failure modes across different models.
The experiments conducted are robust, evaluating 53 model-language pairs across six languages and nine ASR models. The use of the FLEURS test sets is appropriate, and the systematic measurement of SFR across various models highlights the effectiveness of the proposed metric. The results clearly demonstrate the prevalence of script collapse in certain models, particularly the Whisper family, and the paper effectively uses statistical confidence intervals to support its findings.
The paper provides a clear description of the experimental setup, including datasets and models used, and offers access to the code and results via Hugging Face. However, the lack of a peer-reviewed venue may raise concerns about the rigor of the validation process, although the author mentions validation against known positives and negatives.
The primary limitation is that SFR does not differentiate between high-quality target-script text and random characters from the correct script, which could lead to misleading interpretations. Additionally, the Unicode block specifications may be approximate, potentially affecting the accuracy of SFR for languages that use characters from multiple blocks.
The introduction of SFR has significant implications for the deployment of ASR systems in multilingual environments, particularly for low-resource languages. By enabling continuous monitoring of script fidelity, it can help developers identify and rectify issues before they affect end-users. This metric could foster improvements in ASR technologies, making them more reliable for diverse linguistic contexts. The paper presents a novel metric, Script Fidelity Rate (SFR), that effectively measures the fidelity of ASR outputs in multilingual contexts, addressing a critical gap in existing evaluation methodologies. The comprehensive analysis of technical contributions, methodology, and significance to the field underscores the potential for SFR to enhance the reliability of ASR systems across diverse languages and scripts.
Large Audio-Language Models (LALMs) have set new benchmarks in speech processing, yet their deployment is hindered by the memory footprint of the Key-Value (KV) cache during long-context inference. While general KV cache compression techniques excel in LLMs, they often fail in the audio domain by overlooking the intrinsic temporal continuity of acoustic signals. To bridge this gap, we propose AudioKV, a novel framework that robustly prioritizes audio-critical attention heads through a hardware-friendly semantic-acoustic alignment mechanism. Specifically, we identify these modality-specialized heads by analyzing attention scores in ASR tasks and dynamically allocate KV cache budgets preferentially to them. Furthermore, we introduce Spectral Score Smoothing (SSS), an FFT-based global filtering strategy designed to suppress high-frequency noise and recover smooth global trends from importance scores, ensuring more balanced token selection with unprecedented precision. Extensive evaluations across multiple LALMs, including Qwen and Gemma series, demonstrate that AudioKV significantly outperforms baselines while enhancing computational efficiency. Notably, at a 40% compression ratio, AudioKV maintains near-full accuracy on Qwen3-Omni-30B with only a 0.45% drop, whereas traditional methods suffer from catastrophic performance degradation and repetition. Our code will be released after acceptance.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Shanghai Jiao Tong University, HKUST (GZ), Xidian University
The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work addresses a critical bottleneck in deploying LALMs and offers a robust solution that combines innovative methodologies with thorough experimental validation, marking a meaningful advancement in the field of machine learning for audio processing.
The methodology presented in the paper is innovative, focusing on the unique challenges of Key-Value (KV) cache management in Large Audio-Language Models (LALMs). The authors propose a dual approach that combines audio-aware head allocation with Spectral Score Smoothing (SSS) to enhance the efficiency of KV cache usage. The identification of audio-critical attention heads through attention score analysis is a significant contribution, as it allows for a more nuanced allocation of resources compared to traditional uniform methods. The SSS technique, which employs FFT-based filtering to stabilize importance scores, is particularly noteworthy for its potential to improve performance in dynamic audio contexts.
The experiments are comprehensive and demonstrate the effectiveness of AudioKV across multiple benchmarks, including Automatic Speech Recognition (ASR) and Speech Translation (ST). The results show that AudioKV outperforms existing methods significantly, especially at high compression ratios where other methods fail. The use of diverse datasets and models strengthens the validity of the findings, and the detailed performance metrics provide a clear picture of the advantages of the proposed method.
The paper mentions that the code will be released after acceptance, which is a positive step towards reproducibility. However, the absence of a public demo or project URL limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but the lack of a publicly available codebase at this time is a drawback.
One limitation noted in the paper is the potential for repetition and degeneration in output under high KV cache compression ratios, which could affect the quality of generated text. Additionally, while the method shows promise, its applicability to other modalities beyond audio is not explored, which may limit its generalizability.
The implications of this work are significant for the deployment of LALMs in real-world applications, particularly in resource-constrained environments where efficient memory usage is critical. The techniques developed could lead to advancements in speech recognition and multimodal interactions, potentially enhancing user experiences in various applications such as virtual assistants, transcription services, and interactive audio systems. The main contribution of this paper is the introduction of AudioKV, a novel framework for efficient KV cache management in audio-language models, which significantly enhances performance while reducing memory usage. This work addresses a critical bottleneck in deploying LALMs and offers a robust solution that combines innovative methodologies with thorough experimental validation, marking a meaningful advancement in the field of machine learning for audio processing.
Target Speaker Extraction (TSE) aims to isolate a specific speaker's voice from a mixture, guided by a pre-recorded enrollment. While TSE bypasses the global permutation ambiguity of blind source separation, it remains vulnerable to speaker confusion, where models mistakenly extract the interfering speaker. Furthermore, conventional TSE relies on static inference pipeline, where performance is limited by the quality of the fixed enrollment. To overcome these limitations, we propose EvoTSE, an evolving TSE framework in which the enrollment is continuously updated through reliability-filtered retrieval over high-confidence historical estimates. This mechanism reduces speaker confusion and relaxes the quality requirements for pre-recorded enrollment without relying on additional annotated data. Experiments across multiple benchmarks demonstrate that EvoTSE achieves consistent improvements, especially when evaluated on out-of-domain (OOD) scenarios. Our code and checkpoints are available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Nanjing University, Huawei Technologies Co., Ltd.
The main contribution of this paper is the introduction of EvoTSE, a novel framework for Target Speaker Extraction that dynamically updates speaker enrollments to mitigate speaker confusion and improve performance in challenging audio environments. This work significantly advances the state of the art in TSE, particularly in handling out-of-domain scenarios, and provides a solid foundation for future research in audio processing and speaker identification.
The proposed EvoTSE framework innovatively addresses the limitations of static enrollment in Target Speaker Extraction (TSE) by introducing a dynamic, evolving enrollment mechanism that utilizes historical context to adaptively update speaker cues. The methodology integrates a contextual retriever, backbone extractor, reliability classifier, and memory curator, which collectively enhance the robustness of speaker extraction in long-duration audio scenarios. The approach is well-structured and leverages existing concepts like Retrieval-Augmented Generation (RAG) while extending them into the audio domain, showcasing a thoughtful adaptation of techniques to solve a specific problem in TSE.
The experimental setup is comprehensive, utilizing multiple datasets including WSJ0-2mix, Libri2mix-clean, and a newly constructed Emotional Speech Database (ESD) to evaluate the model's performance across various conditions. The results demonstrate consistent improvements in extraction quality, particularly in out-of-domain scenarios, which is a significant contribution to the field. The use of multiple evaluation metrics, including SI-SDRi and NSR, provides a robust framework for assessing the model's effectiveness.
The paper provides sufficient implementation details, including model configurations and training strategies, which enhance reproducibility. However, the absence of a clear mention of the specific venue or publication may hinder broader accessibility to the research community. The availability of code and checkpoints on GitHub is a positive aspect that supports reproducibility.
One limitation is the reliance on the quality of historical estimates, which may introduce noise if the initial enrollment is poor. Additionally, while the framework shows promise in OOD scenarios, the paper does not extensively discuss the computational complexity or real-time applicability of the EvoTSE framework in practical applications.
The EvoTSE framework has significant implications for real-world applications such as voice assistants, automated transcription services, and any system requiring speaker identification in noisy environments. By improving the robustness of TSE, this work could enhance user experiences in various audio processing applications, particularly in dynamic and emotionally varied contexts. The main contribution of this paper is the introduction of EvoTSE, a novel framework for Target Speaker Extraction that dynamically updates speaker enrollments to mitigate speaker confusion and improve performance in challenging audio environments. This work significantly advances the state of the art in TSE, particularly in handling out-of-domain scenarios, and provides a solid foundation for future research in audio processing and speaker identification.
Cross-lingual Speech Emotion Recognition (CLSER) aims to identify emotional states in unseen languages. However, existing methods heavily rely on the semantic synchrony of complete labels and static feature stability, hindering low-resource languages from reaching high-resource performance. To address this, we propose a semi-supervised framework based on Semantic-Emotional Resonance Embedding (SERE), a cross-lingual dynamic feature paradigm that requires neither target language labels nor translation alignment. Specifically, SERE constructs an emotion-semantic structure using a small number of labeled samples. It learns human emotional experiences through an Instantaneous Resonance Field (IRF), enabling unlabeled samples to self-organize into this structure. This achieves semi-supervised semantic guidance and structural discovery. Additionally, we design a Triple-Resonance Interaction Chain (TRIC) loss to enable the model to reinforce the interaction and embedding capabilities between labeled and unlabeled samples during emotional highlights. Extensive experiments across multiple languages demonstrate the effectiveness of our method, requiring only 5-shot labeling in the source language.
Primary: Xinjiang University
All Institutions: Xinjiang University, Pengcheng Laboratory Xinjiang Network Node, Xinjiang Multimodal Intelligent Processing and Information Security Engineering Technology Research Center, Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing
The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that effectively utilizes limited labeled data to improve performance across multiple languages. The technical contributions, particularly the novel use of dynamic feature extraction and interaction mechanisms, position this work as a meaningful advancement in the field of machine learning and emotion recognition.
The proposed methodology introduces a novel semi-supervised framework, Semantic-Emotional Resonance Embedding (SERE), which effectively addresses the challenges of cross-lingual speech emotion recognition (CLSER) by leveraging a small number of labeled samples to construct an emotion-semantic structure. The use of the Instantaneous Resonance Field (IRF) and the Triple-Resonance Interaction Chain (TRIC) loss is innovative, allowing for dynamic feature extraction and interaction between labeled and unlabeled data, which enhances the model's ability to generalize across languages.
The experiments are extensive, covering multiple languages and demonstrating the effectiveness of the proposed method with only 5-shot labeling. The results show significant improvements over existing methods, indicating the robustness of the approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and additional metrics to strengthen the evaluation.
While the methodology is described in detail, the lack of a publicly available code repository limits reproducibility. Including implementation details, hyperparameters, and data preprocessing steps would enhance reproducibility.
The paper acknowledges the challenge of emotional pronunciation differences across languages, which can lead to misclassification. Additionally, the reliance on a small number of labeled samples may limit the applicability of the method in more complex scenarios.
The proposed framework has significant implications for low-resource languages in emotional recognition tasks, potentially enhancing multilingual communication technologies and applications in areas such as mental health monitoring, customer service, and human-computer interaction. The paper presents a semi-supervised framework for cross-lingual speech emotion recognition that effectively utilizes limited labeled data to improve performance across multiple languages. The technical contributions, particularly the novel use of dynamic feature extraction and interaction mechanisms, position this work as a meaningful advancement in the field of machine learning and emotion recognition.
We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.
Primary: University of California San Diego
All Institutions: University of California San Diego
The paper presents a comprehensive framework for real-time human-AI musical co-performance, utilizing latent diffusion models for generating instrumental accompaniment. The methodology effectively addresses the challenges of latency in generative models, and the results indicate strong potential for practical applications in live music settings.
The paper presents a novel framework for real-time human-AI musical co-performance utilizing latent diffusion models (LDMs) for generating instrumental accompaniment. The methodology is well-structured, combining a MAX/MSP front-end with a Python inference server, which is a significant step in bridging the gap between real-time audio processing and advanced AI models. The sliding-window look-ahead protocol is a clever approach to managing the inherent latency of generative models, allowing for continuous audio generation. The introduction of consistency distillation to reduce sampling time while maintaining audio quality is particularly innovative. However, the paper could benefit from a more detailed exploration of the implications of the look-ahead depth on musical coherence and generation quality.
The experimental setup is robust, utilizing the Slakh2100 dataset and a clear methodology for evaluating musical coherence, beat alignment, and audio quality. The results demonstrate strong performance across various configurations, showcasing the effectiveness of the proposed models in both retrospective and look-ahead regimes. The use of objective metrics such as COCOLA and Beat F1 scores provides a solid foundation for assessing the models' performance. However, the paper lacks a detailed comparison of subjective evaluations alongside the objective metrics, which would enhance the understanding of the models' performance from a listener's perspective.
The authors have made significant efforts to ensure reproducibility by providing access to the model code, pre-trained checkpoints, and detailed descriptions of the experimental setup. The inclusion of GitHub repositories and a demo page further aids in this regard. However, the paper could improve by providing clearer instructions on the setup process for users who may not be familiar with the technologies used, such as MAX/MSP and the specific configurations for the Python inference server.
One limitation of the study is the reliance on a specific dataset (Slakh2100), which may not fully represent the diversity of musical styles and contexts that the system could encounter in real-world applications. Additionally, while the look-ahead mechanism is innovative, it introduces a trade-off between latency and generation quality that may not be fully addressed in the current framework. The paper also does not explore the potential for user customization or adaptation of the system for different musical genres or performance contexts.
The proposed framework has significant implications for the field of music technology and AI, as it opens up new avenues for real-time collaboration between human musicians and AI systems. This could lead to enhanced creative possibilities in live performance settings, potentially transforming how music is created and experienced. The integration of AI into live performance also raises questions about authorship and the role of technology in artistic expression, which could spark further research and discussion in the field. The paper presents a comprehensive framework for real-time human-AI musical co-performance, utilizing latent diffusion models for generating instrumental accompaniment. The methodology effectively addresses the challenges of latency in generative models, and the results indicate strong potential for practical applications in live music settings.
Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.
Primary: Indian Institute of Science
All Institutions: Indian Institute of Science
The main contribution of this paper is the introduction of the ULTRAS framework, which effectively integrates self-supervised learning techniques for joint modeling of audio and speech signals, showcasing significant improvements in performance across diverse tasks. This work represents a meaningful advancement in the field, addressing existing limitations in audio representation learning and providing a foundation for future research.
The proposed ULTRAS framework introduces a novel approach to self-supervised learning by integrating long-context masking and joint predictive modeling of both spectral and temporal targets. This methodology is a significant advancement over existing models, which typically focus on either temporal or spectral features separately. The use of transformer architecture to encode log-mel spectrograms, combined with a unique loss function that balances spectral and temporal predictions, showcases a well-thought-out design that addresses the limitations of previous models. The masking strategy, which operates over longer audio segments, is particularly innovative and is likely to enhance the model's ability to capture contextual information effectively.
The experiments conducted across a diverse set of speech and audio tasks demonstrate the robustness of the ULTRAS framework. The paper provides comprehensive evaluations using multiple datasets, including LibriSpeech and AudioSet, and compares the performance against established baselines. The results indicate that ULTRAS consistently outperforms these baselines, particularly in scenarios where both speech and audio tasks are involved. The inclusion of ablation studies further strengthens the findings by illustrating the contribution of each component of the proposed method.
The paper outlines the implementation details, including the pre-training and evaluation protocols, which are crucial for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to replicate the results independently. Clearer documentation or a supplementary repository would enhance reproducibility.
One limitation of the study is the reliance on a relatively small dataset for some experiments (200 hours), which may affect the generalizability of the results. Additionally, while the model shows improved performance, it is not clear how it scales with larger datasets or more complex tasks. The paper could also benefit from a more thorough discussion of potential biases in the datasets used.
The ULTRAS framework has the potential to significantly impact the fields of audio and speech processing by providing a unified approach that can be applied across various tasks. Its ability to learn robust representations from both speech and general audio signals could lead to advancements in applications such as automatic speech recognition, emotion recognition, and environmental sound classification. The implications of this work extend to improving the efficiency of training models in low-resource settings, thereby democratizing access to advanced audio processing technologies. The main contribution of this paper is the introduction of the ULTRAS framework, which effectively integrates self-supervised learning techniques for joint modeling of audio and speech signals, showcasing significant improvements in performance across diverse tasks. This work represents a meaningful advancement in the field, addressing existing limitations in audio representation learning and providing a foundation for future research.
The human auditory system has the ability to selectively focus on key speech elements in an audio stream while giving secondary attention to less relevant areas such as noise or distortion within the background, dynamically adjusting its attention over time. Inspired by the recent success of attention models, this study introduces a dual-path attention module in the bottleneck layer of a concurrent speech enhancement network. Our study proposes an attention-based dual-path RNN (DAT-RNN), which, when combined with the modified complex-valued frequency transformation network (CFTNet), forms the DAT-CFTNet. This attention mechanism allows for precise differentiation between speech and noise in time-frequency (T-F) regions of spectrograms, optimizing both local and global context information processing in the CFTNet. Our experiments suggest that the DAT-CFTNet leads to consistently improved performance over the existing models, including CFTNet and DCCRN, in terms of speech intelligibility and quality. Moreover, the proposed model exhibits superior performance in enhancing speech intelligibility for cochlear implant (CI) recipients, who are known to have severely limited T-F hearing restoration (e.g., >10%) in CI listener studies in noisy settings show the proposed solution is capable of suppressing non-stationary noise, avoiding the musical artifacts often seen in traditional speech enhancement methods. The implementation of the proposed model will be publicly available.
Primary: Chittagong University of Engineering and Technology
All Institutions: Chittagong University of Engineering and Technology
The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
The proposed methodology introduces a novel dual-path attention mechanism integrated into a complex-valued frequency transformation network (CFTNet), which is a significant advancement in the field of speech enhancement, particularly for cochlear implant users. The combination of intra-chunk and inter-chunk RNNs with attention modules allows for enhanced modeling of speech and noise dynamics in time-frequency representations. The detailed architecture and the rationale behind the design choices are well articulated, showcasing a thoughtful approach to addressing the limitations of existing models.
The experiments are robust, employing a comprehensive dataset that includes various noise conditions and SNR levels. The evaluation metrics used (STOI, PESQ, SISDR) are appropriate for assessing speech intelligibility and quality. The results demonstrate significant improvements over baseline models, indicating the effectiveness of the proposed approach. However, the paper could benefit from more detailed comparisons with state-of-the-art methods and a discussion on the statistical significance of the results.
The paper lacks sufficient implementation details that would facilitate reproducibility. While it mentions the use of a specific dataset and the architecture of the model, there are no code repositories or links to a demo that would allow other researchers to replicate the findings. Providing access to the model and training scripts would greatly enhance reproducibility.
One limitation is the reliance on objective metrics without a thorough subjective evaluation involving human listeners. While objective scores are important, subjective assessments are crucial for applications in speech enhancement, especially for cochlear implant users. Additionally, the model's complexity may limit its applicability in real-time scenarios, which is a critical factor for practical implementations.
The proposed DAT-CFTNet has the potential to significantly improve the quality of life for cochlear implant recipients by enhancing speech intelligibility in noisy environments. This advancement could lead to better communication and social interactions for individuals with hearing impairments. The public availability of the model also encourages further research and development in the field. The main contribution of this research is the introduction of the DAT-CFTNet, which effectively enhances speech intelligibility for cochlear implant users through an innovative dual-path attention mechanism. This work represents a significant step forward in speech enhancement technologies, particularly in challenging acoustic environments.
Recent diffusion-based text-to-speech (TTS) models achieve high naturalness and expressiveness, yet often suffer from speaker drift, a subtle, gradual shift in perceived speaker identity within a single utterance. This underexplored phenomenon undermines the coherence of synthetic speech, especially in long-form or interactive settings. We introduce the first automatic framework for detecting speaker drift by formulating it as a binary classification task over utterance-level speaker consistency. Our method computes cosine similarity across overlapping segments of synthesized speech and prompts large language models (LLMs) with structured representations to assess drift. We provide theoretical guarantees for cosine-based drift detection and demonstrate that speaker embeddings exhibit meaningful geometric clustering on the unit sphere. To support evaluation, we construct a high-quality synthetic benchmark with human-validated speaker drift annotations. Experiments with multiple state-of-the-art LLMs confirm the viability of this embedding-to-reasoning pipeline. Our work establishes speaker drift as a standalone research problem and bridges geometric signal analysis with LLM-based perceptual reasoning in modern TTS.
Primary: University of Amsterdam
All Institutions: University of Amsterdam, Georgia Institute of Technology, Halmstad University
This paper introduces a pioneering framework for automatic speaker drift detection in synthesized speech, leveraging cosine similarity and LLMs to enhance the coherence of TTS systems. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech synthesis.
The proposed methodology effectively addresses the issue of speaker drift in synthesized speech by formulating it as a binary classification task. The use of cosine similarity to assess speaker consistency is both innovative and theoretically justified, providing a solid foundation for the proposed framework. The integration of large language models (LLMs) for reasoning based on structured representations of similarity scores is a novel approach that bridges low-level acoustic features with high-level cognitive evaluation. The construction of a synthetic benchmark dataset with human-validated annotations is a significant contribution, allowing for systematic evaluation of the proposed method.
The experiments conducted are robust, with a clear evaluation strategy involving multiple state-of-the-art LLMs. The results demonstrate the effectiveness of the proposed framework in detecting speaker drift, outperforming baseline methods. The use of F1 scores and accuracy as evaluation metrics is appropriate, and the ablation studies provide insights into the impact of different design choices on performance. However, the dataset size could be larger to enhance the generalizability of the findings.
The paper provides sufficient details regarding the methodology and experimental setup, allowing for reproducibility. However, the lack of a publicly accessible dataset or code repository limits the ease with which other researchers can replicate the results. Providing a demo or project URL would enhance reproducibility.
One limitation is the reliance on synthetic data, which may not fully capture the complexities of real-world speaker drift scenarios. Additionally, while the theoretical guarantees for cosine similarity are compelling, the practical implications of these guarantees in diverse acoustic environments remain to be explored. The dataset's size and diversity may also restrict the generalization of the findings.
The implications of this research are significant for applications in TTS systems, particularly in enhancing user experience in interactive and long-form speech applications. By addressing speaker drift, the framework can improve the coherence and naturalness of synthesized speech, which is crucial for virtual assistants, audiobooks, and other multimedia applications. The work also opens avenues for further research in speaker consistency and the integration of LLMs in audio processing tasks. This paper introduces a pioneering framework for automatic speaker drift detection in synthesized speech, leveraging cosine similarity and LLMs to enhance the coherence of TTS systems. The methodology is innovative, and the experimental results demonstrate substantial technical impact, making it a valuable contribution to the field of machine learning and speech synthesis.
This paper presents the submission of the S4 team to the Singing Voice Conversion Challenge 2025 (SVCC2025)-a novel singing style conversion system that advances fine-grained style conversion and control within in-domain settings. To address the critical challenges of style leakage, dynamic rendering, and high-fidelity generation with limited data, we introduce three key innovations: a boundary-aware Whisper bottleneck that pools phoneme-span representations to suppress residual source style while preserving linguistic content; an explicit frame-level technique matrix, enhanced by targeted F0 processing during inference, for stable and distinct dynamic style rendering; and a perceptually motivated high-frequency band completion strategy that leverages an auxiliary standard 48kHz SVC model to augment the high-frequency spectrum, thereby overcoming data scarcity without overfitting. In the official SVCC2025 subjective evaluation, our system achieves the best naturalness performance among all submissions while maintaining competitive results in speaker similarity and technique control, despite using significantly less extra singing data than other top-performing systems. Audio samples are available online.
Primary: Xi'an Jiaotong University
All Institutions: Xi'an Jiaotong University, Fudan University, Wheatland Culture and Media Ltd.
The main contribution of this paper is the introduction of a controllable singing style conversion system that effectively mitigates style leakage and enhances dynamic rendering through innovative methodologies. This work significantly advances the state of the art in singing voice conversion, demonstrating high fidelity and naturalness even with limited training data, and sets a strong foundation for future research in this domain.
The paper introduces a novel approach to singing style conversion that effectively addresses style leakage and dynamic rendering issues through a boundary-aware semantic bottleneck and an explicit technique matrix. The methodology is well-structured, leveraging phoneme-level pooling to enhance control over the conversion process. The use of auxiliary models for high-frequency band completion is particularly innovative, allowing the authors to achieve high fidelity despite data limitations. The integration of targeted pitch processing during inference further enhances the system's performance, demonstrating a comprehensive understanding of the challenges in singing voice conversion.
The experimental setup is robust, with a clear description of the training and evaluation processes. The authors conducted subjective evaluations in the SVCC2025 challenge, achieving the best naturalness score among all submissions, which underscores the effectiveness of their approach. The ablation studies provide valuable insights into the contributions of various components, validating the importance of the boundary-aware pooling and technique matrix in reducing style leakage and improving controllability.
The paper provides sufficient details regarding the methodology and experimental setup, including the training stages and the use of specific models and datasets. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to replicate the experiments and build upon the findings.
While the paper presents significant advancements, it does not address the potential challenges of generalizing the model to out-of-domain singing styles or the limitations of the dataset used for training. Additionally, the reliance on specific phoneme boundaries and technique annotations may limit the model's applicability in more diverse or less structured datasets.
The advancements in controllable singing style conversion have implications for various applications, including music production, voice synthesis for entertainment, and personalized audio experiences. The techniques developed could also be adapted for other audio processing tasks, contributing to the broader field of generative audio systems. The main contribution of this paper is the introduction of a controllable singing style conversion system that effectively mitigates style leakage and enhances dynamic rendering through innovative methodologies. This work significantly advances the state of the art in singing voice conversion, demonstrating high fidelity and naturalness even with limited training data, and sets a strong foundation for future research in this domain.
In biometric systems, it is a common practice to associate each sample or template with a specific individual. Nevertheless, recent studies have demonstrated the feasibility of generating "morphed" biometric samples capable of matching multiple identities. These morph attacks have been recognized as potential security risks for biometric systems. However, most research on morph attacks has focused on biometric modalities that operate within the image domain, such as the face, fingerprints, and iris. In this work, we introduce Time-domain Voice Identity Morphing (TD-VIM), a novel approach for voice-based biometric morphing. This method enables the blending of voice characteristics from two distinct identities at the signal level, creating morphed samples that present a high vulnerability for speaker verification systems. Leveraging the Multilingual Audio-Visual Smartphone database, our study created four distinct morphed signals based on morphing factors and evaluated their effectiveness using a comprehensive vulnerability analysis. To assess the security impact of TD-VIM, we benchmarked our approach using the Generalized Morphing Attack Potential (G-MAP) metric, measuring attack success across two deep-learning-based Speaker Verification Systems (SVS) and one commercial system, Verispeak. Our findings indicate that the morphed voice samples achieved a high attack success rate, with G-MAP values reaching 99.40% on iPhone-11 and 99.74% on Samsung S8 in text-dependent scenarios, at a false match rate of 0.1%.
Primary: Indian Institute of Technology Kharagpur
All Institutions: Indian Institute of Technology Kharagpur, Norwegian University of Science and Technology (NTNU)
This paper introduces TD-VIM, a novel voice morphing technique that significantly enhances the vulnerability of speaker verification systems, thereby emphasizing the urgent need for improved security protocols in biometric applications. The comprehensive methodology and rigorous experimental validation contribute valuable insights to the field of biometric security and machine learning.
The proposed Time-Domain Voice Identity Morphing (TD-VIM) method innovatively operates at the signal level, allowing for morphing without reliance on feature embeddings or reference text, which addresses limitations found in previous methods. The methodology is well-structured, with clear steps for speaker selection, signal processing, and morphing, making it accessible for replication.
The experiments utilize a robust dataset (MAVS) and benchmark the TD-VIM against multiple speaker verification systems, demonstrating high attack success rates. The use of the Generalized Morphing Attack Potential (G-MAP) metric is a significant contribution, providing a comprehensive measure of vulnerability across different devices and languages.
The authors provide a GitHub repository for the source code and state that the morphed files and original dataset can be obtained upon request, promoting transparency and reproducibility.
The study does not address the potential ethical implications of morphing techniques in biometric security, nor does it explore the long-term effectiveness of the proposed method against evolving SVS technologies. Additionally, the reliance on specific datasets may limit generalizability.
The findings highlight significant vulnerabilities in voice biometric systems, particularly in sensitive applications such as banking and finance, raising awareness about the need for enhanced security measures in biometric verification systems. This paper introduces TD-VIM, a novel voice morphing technique that significantly enhances the vulnerability of speaker verification systems, thereby emphasizing the urgent need for improved security protocols in biometric applications. The comprehensive methodology and rigorous experimental validation contribute valuable insights to the field of biometric security and machine learning.
Generating long sequences with structural coherence remains a fundamental challenge for autoregressive models across sequential generation tasks. In symbolic music generation, this challenge is particularly pronounced, as existing methods are constrained by the inherent severe error accumulation problem of autoregressive models, leading to poor performance in music quality and structural integrity. In this paper, we propose the Anchored Cyclic Generation (ACG) paradigm, which relies on anchor features from already identified music to guide subsequent generation during the autoregressive process, effectively mitigating error accumulation in autoregressive methods. Based on the ACG paradigm, we further propose the Hierarchical Anchored Cyclic Generation (Hi-ACG) framework, which employs a systematic global-to-local generation strategy and is highly compatible with our specifically designed piano token, an efficient musical representation. The experimental results demonstrate that compared to traditional autoregressive models, the ACG paradigm achieves reduces cosine distance by an average of 34.7% between predicted feature vectors and ground-truth semantic vectors. In long-sequence symbolic music generation tasks, the Hi-ACG framework significantly outperforms existing mainstream methods in both subjective and objective evaluations. Furthermore, the framework exhibits excellent task generalization capabilities, achieving superior performance in related tasks such as music completion.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.
The paper introduces the Anchored Cyclic Generation (ACG) paradigm, which effectively addresses the error accumulation problem in autoregressive models for long-sequence symbolic music generation. The methodology is well-structured, employing a hierarchical approach through the Hi-ACG framework that combines global and local generation strategies. The use of a novel piano token representation enhances efficiency and interpretability. The proposed methods are theoretically sound, supported by mathematical analysis, and demonstrate a clear innovation in the field of music generation.
The experimental evaluation is robust, utilizing both objective and subjective metrics to assess the performance of the proposed models against established baselines. The datasets used (MuseScore and POP909) are appropriate for the task, and the results indicate significant improvements in generation quality, as evidenced by a 34.7% reduction in cosine distance between predicted and ground-truth features. The comprehensive evaluation strategy enhances the credibility of the findings.
The paper provides sufficient details regarding the experimental setup, including model architecture, training procedures, and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate validation of results.
The paper acknowledges limitations in fine-grained control during generation and the potential loss of subtle timing nuances in the piano token representation. Additionally, the focus on piano music may restrict the applicability of the framework to other musical contexts. Future research should address these limitations by integrating more expressive tokens and extending the framework to multi-track music generation.
The proposed ACG paradigm has the potential to significantly advance the field of symbolic music generation, offering new avenues for creating high-quality, structurally coherent music. Its principles could be adapted to other long-sequence generation tasks beyond music, such as text generation and structured content synthesis, thereby broadening its impact across various domains. The paper presents a novel approach to long-sequence symbolic music generation through the Anchored Cyclic Generation paradigm, demonstrating significant improvements in quality and structural integrity. The methodology is innovative and well-supported by experimental results, marking a meaningful contribution to the field of machine learning in music generation.