We introduce Nemotron 3 Nano Omni, the latest model in the Nemotron multimodal series and the first to natively support audio inputs alongside text, images, and video. Nemotron 3 Nano Omni delivers consistent accuracy improvements over its predecessor, Nemotron Nano V2 VL, across all modalities, enabled by advances in architecture, training data and recipes. In particular, Nemotron 3 delivers leading results in real-world document understanding, long audio-video comprehension, and agentic computer use. Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni further incorporates innovative multimodal token-reduction techniques to deliver substantially lower inference latency and higher throughput than other models of similar size. We are releasing model checkpoints in BF16, FP8, and FP4 formats, along with portions of the training data and codebase to facilitate further research and development.
Primary: NVIDIA
All Institutions: NVIDIA
The main contribution of this paper is the introduction of Nemotron 3 Nano Omni, an efficient multimodal model that significantly enhances audio, visual, and textual reasoning capabilities. This work represents a substantial step forward in the integration of multimodal inputs, addressing key challenges in the field while providing a solid foundation for future research and applications.
The paper presents a comprehensive multimodal model, Nemotron 3 Nano Omni, which integrates audio, text, images, and video inputs. The methodology is robust, employing a mixture-of-experts (MoE) architecture that enhances processing efficiency for long multimodal sequences. The introduction of dynamic image resolution and Conv3D-based temporal video compression represents significant advancements in handling multimodal data. The staged training approach is particularly noteworthy, as it addresses the challenges of modality alignment and catastrophic forgetting, ensuring stable cross-modal integration. The innovative multimodal token-reduction techniques further enhance inference efficiency, which is crucial for real-time applications.
The experimental evaluation is thorough, with extensive benchmarking across various tasks, including document understanding, audio-visual reasoning, and voice interaction. The model demonstrates leading performance on multiple leaderboards, indicating its effectiveness in real-world applications. The results are quantitatively supported by comparisons with previous models, showcasing significant improvements in accuracy and efficiency. However, the paper could benefit from more qualitative assessments of the model's performance in practical scenarios.
The authors provide model checkpoints in multiple formats (BF16, FP8, FP4) and share portions of the training data and codebase on Hugging Face and GitHub. This transparency enhances reproducibility, allowing other researchers to replicate the experiments and build upon the work. However, the detailed training recipes and hyperparameters could be more explicitly documented to facilitate easier reproduction of the results.
While the model shows impressive results, it may still face challenges with certain edge cases in multimodal reasoning, particularly in noisy or ambiguous inputs. The reliance on large-scale training data and computational resources may limit accessibility for smaller research groups. Additionally, the paper does not address potential biases in the training data, which could affect the model's generalizability across diverse applications.
The advancements presented in this paper have significant implications for various fields, including human-computer interaction, automated content generation, and assistive technologies. The model's ability to process and reason across multiple modalities can enhance user experiences in applications such as virtual assistants, educational tools, and multimedia content analysis. The open release of the model and its components encourages further research and development in multimodal AI, fostering innovation in the field. The main contribution of this paper is the introduction of Nemotron 3 Nano Omni, an efficient multimodal model that significantly enhances audio, visual, and textual reasoning capabilities. This work represents a substantial step forward in the integration of multimodal inputs, addressing key challenges in the field while providing a solid foundation for future research and applications.
To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Center for Language and Speech Processing, Human Language Technology Center of Excellence (COE)
The main contribution of this paper is the introduction of DiffAnon, a diffusion-based voice anonymization framework that enables explicit and continuous control over prosody preservation, significantly advancing the field of privacy-preserving speech technologies. This work represents a meaningful step forward in balancing the utility-privacy trade-off in voice applications, showcasing the potential for structured prosody control in enhancing both privacy and expressiveness in anonymized speech.
The proposed methodology, DiffAnon, leverages a novel diffusion-based framework with classifier-free guidance to provide continuous control over prosody preservation in voice anonymization. This approach is innovative as it allows for the modulation of the utility-privacy trade-off in a structured manner, which is a significant advancement over existing methods that operate at fixed points. The integration of semantic embeddings from an RVQ codec with a diffusion model is particularly noteworthy, as it combines strengths from both domains to enhance the quality of anonymized speech.
The experiments are robust, utilizing the VoicePrivacy Challenge 2024 protocol, which provides a standardized framework for evaluating privacy and utility. The results demonstrate that DiffAnon achieves competitive performance across various metrics, including EER for privacy and WER for content preservation, while also showing a clear trade-off between privacy and prosodic fidelity. The systematic evaluation across different prosody guidance weights adds depth to the findings.
The authors have made their code and pretrained models publicly available, which is a strong point for reproducibility. The detailed training and inference setup, including hyperparameters and datasets used, further supports replicability of the results.
While the paper presents a significant advancement, it does not explore the potential impact of varying speaker characteristics on the performance of the model. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other languages or dialects. The paper also does not address the computational costs associated with training and deploying the model in real-world applications.
The ability to anonymize voice while preserving prosody has significant implications for privacy in various applications, including telecommunication, virtual assistants, and voice-based interactions. This work could enhance user trust in voice technologies by providing a means to protect identity while maintaining communicative effectiveness. The structured control over prosody could also lead to advancements in emotional speech synthesis and human-computer interaction. The main contribution of this paper is the introduction of DiffAnon, a diffusion-based voice anonymization framework that enables explicit and continuous control over prosody preservation, significantly advancing the field of privacy-preserving speech technologies. This work represents a meaningful step forward in balancing the utility-privacy trade-off in voice applications, showcasing the potential for structured prosody control in enhancing both privacy and expressiveness in anonymized speech.
Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in existing datasets. To address these challenges, we reformulate ACR as a segment-level sequence-to-sequence prediction task, where chord sequences are predicted auto-regressively rather than frame by frame. This design mitigates excessive segmentation by detecting chord changes only at segment boundaries. We further introduce two types of token representations and an encoder pre-training method, both specifically designed for time-aligned chord modeling. Experimental results show that our model improves performance in both chord recognition and segmentation, with notable gains for complex and infrequent chord types. These findings demonstrate the effectiveness of segment-level sequence modeling, structured tokenization, and representation learning for advancing chord recognition systems.
Primary: Seoul National University
All Institutions: Seoul National University
This paper presents a significant advancement in automatic chord recognition through a novel segment-level sequence modeling approach, effectively addressing oversegmentation and data imbalance challenges. The methodology is well-structured, and the experimental results demonstrate substantial improvements, marking a meaningful contribution to the field of music information retrieval.
The paper introduces a novel segment-level sequence-to-sequence approach for automatic chord recognition (ACR), effectively addressing oversegmentation and data imbalance issues prevalent in traditional frame-level methods. The use of a Transformer encoder-decoder architecture is well-justified, and the introduction of two token representations (MERGE and SPLIT) demonstrates a thoughtful approach to chord modeling. The encoder pre-training method based on chord similarity is innovative and enhances the model's ability to generalize, particularly for complex chord types.
The experiments are comprehensive, utilizing a well-defined dataset of 471 pop songs with manual annotations. The use of 5-fold cross-validation strengthens the reliability of the results. The reported improvements in both chord recognition and segmentation metrics, particularly for complex chords, are significant and demonstrate the effectiveness of the proposed methods. The ablation studies provide clear insights into the contributions of each component of the model.
The paper includes sufficient implementation details, such as data preprocessing, model architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The availability of the code repository enhances this aspect, allowing other researchers to replicate the results and build upon this work.
While the paper addresses several critical challenges in ACR, it does not discuss the potential limitations of the proposed methods, such as the reliance on the quality of the training dataset or the challenges in generalizing to genres or styles not represented in the dataset. Additionally, the model's performance on real-world recordings versus studio recordings could be explored further.
The advancements in chord recognition could have significant implications for music information retrieval, music education, and automated music composition systems. By improving the recognition of complex chords, this work could enhance tools for musicians and composers, making music analysis more accessible and efficient. This paper presents a significant advancement in automatic chord recognition through a novel segment-level sequence modeling approach, effectively addressing oversegmentation and data imbalance challenges. The methodology is well-structured, and the experimental results demonstrate substantial improvements, marking a meaningful contribution to the field of music information retrieval.
In this paper, we propose MelShield, a robust, in-generation, keyed audio watermarking framework that embeds identifiable signals into AI-generated audio for copyright protection and reliable attribution. Specifically, MelShield operates in the Mel-spectrogram domain during the generation process, targeting intermediate acoustic representations in Mel-conditioned pipelines for text-to-speech (TTS) generation. The core idea is to treat the intermediate Mel-spectrogram as the host signal and embed a short binary payload via low-energy, keyed spread-spectrum perturbations distributed across carefully selected time-frequency regions prior to waveform synthesis. By performing watermarking before vocoder inference, MelShield remains plug-and-play for Mel-conditioned TTS architectures and does not require modification or retraining of the underlying TTS generation vocoder, such as DiffWave and HiFi-GAN. Moreover, the multi-user keyed construction enables scalable user-specific attribution, while the keyed verification mechanism limits unauthorized decoding, thereby reducing the risk of large-scale extractor probing and adversarial analysis. Extensive experiments on DiffWave and HiFi-GAN demonstrate that MelShield achieves reliable watermark extraction, approaching 100\% bit accuracy, even under signal distortions, e.g., compression and additive noise, while preserving high perceptual audio quality.
Primary: Queen's University
All Institutions: Queen's University, University of Waterloo
MelShield presents a novel in-generation audio watermarking framework that effectively integrates into TTS systems, enhancing copyright protection and attribution mechanisms. The comprehensive evaluation and innovative methodology position this work as a significant contribution to the field of audio processing and machine learning.
The methodology presented in MelShield is innovative, leveraging a keyed spread-spectrum approach for watermarking directly in the Mel-spectrogram domain of TTS systems. This is a significant advancement over traditional post-hoc watermarking methods, as it integrates watermarking seamlessly into the audio generation pipeline without requiring modifications to existing vocoders. The use of low-energy perturbations and adaptive masking to maintain audio quality while embedding watermarks is particularly noteworthy. The authors provide a clear and systematic approach to embedding and extracting watermarks, which is well-justified and theoretically sound.
The experimental evaluation is comprehensive, utilizing two prominent TTS vocoders (DiffWave and HiFi-GAN) and a robust dataset (LJSpeech 1.1). The results demonstrate high bit accuracy for watermark recovery under various conditions, including common signal distortions. The paper effectively compares MelShield against existing watermarking methods, showcasing its superior performance in terms of robustness and fidelity. The use of multiple evaluation metrics (PESQ, STOI, DNSMOS) adds credibility to the results, although the paper could benefit from more extensive user studies to assess perceptual quality in real-world scenarios.
The paper provides a detailed description of the experimental setup, including the datasets, vocoder configurations, and watermark embedding parameters. However, it lacks a publicly accessible code repository or demo URL, which would enhance reproducibility and allow other researchers to validate the findings. Clearer documentation of the implementation would also aid in replicating the experiments.
One limitation is the reliance on specific vocoders, which may not generalize to all TTS systems. While the authors claim model-agnostic deployment, the performance may vary with different architectures not tested in the study. Additionally, the paper does not address potential vulnerabilities to advanced adversarial attacks that could target the watermarking system. The scalability of the approach in high-demand real-world applications remains to be fully explored.
The implications of this work are significant, particularly in the context of copyright protection and attribution for AI-generated audio. As deepfake technologies become more prevalent, robust watermarking solutions like MelShield can help mitigate risks associated with misinformation and unauthorized content distribution. The framework could be applied across various domains, including media production, digital rights management, and content verification systems. MelShield presents a novel in-generation audio watermarking framework that effectively integrates into TTS systems, enhancing copyright protection and attribution mechanisms. The comprehensive evaluation and innovative methodology position this work as a significant contribution to the field of audio processing and machine learning.
Generating expressive conducting gestures from music is a challenging cross-modal motion synthesis problem: the output must follow long-range musical structure, preserve beat-level synchronization, and remain plausible as a fine-grained 3D human performance. Existing conducting-motion studies are often limited by sparse pose representations, small-scale data, or evaluation protocols that do not directly measure whether music and gesture are mutually aligned. This paper presents TransConductor, a Transformer-based framework for music-driven conducting gesture generation. We introduce ConductorMotion, a SMPL-parameter data construction pipeline that recovers detailed body motion from conducting videos and forms a dataset targeted at professional conducting gestures. Given acoustic descriptors extracted from audio and an initial pose, TransConductor uses a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder to autoregressively predict SMPL pose parameters. To better assess artistic correspondence, we further build a retrieval-based evaluation model that embeds music and gestures into a shared space and yields FID, modality distance, multi-modality distance, and diversity metrics. Experiments show that TransConductor outperforms dance-generation and conducting-generation baselines, while ablations verify the benefits of the Transformer backbone and the proposed alignment loss.
Primary: Beijing Jiaotong University
All Institutions: Beijing Jiaotong University, Malou Tech Inc, South-Central Minzu University, Fudan University, Renmin University of China
This paper presents a significant advancement in the field of music-driven motion synthesis through the introduction of a Transformer-based framework for generating conducting gestures. The methodology effectively combines detailed pose representation with a novel evaluation approach, setting a new standard for future research in this area.
The proposed methodology introduces a novel Transformer-based framework, TransConductor, which effectively addresses the challenge of generating conducting gestures from music. The use of SMPL parameters for detailed pose representation is a significant advancement over traditional sparse keypoint methods, allowing for a more nuanced and expressive depiction of conducting motions. The dual encoder-decoder architecture, comprising a Trans-Temporal Music Encoder and a Trans-Temporal Conducting Gesture Decoder, is well-conceived, leveraging the strengths of self-attention mechanisms to capture long-range dependencies in both music and gesture. The introduction of a retrieval-based evaluation model further enhances the methodology by providing a more meaningful assessment of the artistic correspondence between music and gestures, which is often overlooked in traditional metrics.
The experimental evaluation is robust, comparing the proposed model against established baselines in dance and conducting generation. The reported metrics (FID, M-Dist, MM-Dist, and diversity) indicate significant improvements in the quality and alignment of generated gestures with the corresponding music. The ablation studies convincingly demonstrate the contributions of the Transformer architecture and the alignment loss, supporting the claims of enhanced performance. The diversity in the dataset, covering various conducting styles and musical emotions, strengthens the validity of the results and showcases the model's adaptability.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or dataset access, which are crucial for reproducibility. The absence of a demo or project URL further limits the ability of other researchers to validate and build upon this work.
The paper acknowledges certain limitations, including the reliance on monocular reconstruction, which may not capture all nuances of conducting gestures, particularly baton motion and finger articulation. Additionally, the model struggles with very large gestures in energetic music and may lag during fast transitions. These limitations suggest areas for future research, such as incorporating hand-aware reconstruction techniques and exploring longer musical contexts.
The implications of this work extend beyond academic interest; it has potential applications in music education, virtual performances, and intelligent tutoring systems. By automating the generation of conducting gestures, this research could enhance interactive music learning environments and provide valuable tools for musicians and educators. The framework could also inspire further exploration of cross-modal motion synthesis in other artistic domains, promoting a deeper understanding of the interplay between music and movement. This paper presents a significant advancement in the field of music-driven motion synthesis through the introduction of a Transformer-based framework for generating conducting gestures. The methodology effectively combines detailed pose representation with a novel evaluation approach, setting a new standard for future research in this area.
Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.
Primary: South China University of Technology
All Institutions: South China University of Technology
MindMelody presents a novel approach to EEG-driven personalized music intervention, demonstrating a sophisticated integration of machine learning techniques that enhance the adaptability and effectiveness of music therapy. The paper's contributions to the field of affective computing and music generation are substantial, offering a promising direction for future research and applications in mental health.
The methodology presented in MindMelody is innovative, integrating a hybrid Transformer-GNN architecture for EEG decoding with a Retrieval-Augmented Generation (RAG) mechanism to formulate structured intervention plans. The use of a Hierarchical EEG Controller to modulate a pretrained music generation backbone is particularly noteworthy, as it allows for fine-grained control over the music output based on real-time EEG data. The closed-loop feedback mechanism that continuously adapts to user feedback enhances the system's responsiveness and personalization, which is a significant advancement over static music generation systems.
The experiments conducted are robust, utilizing established datasets like DEAP for EEG affect modeling and MusicCaps for controllable music generation. The paper provides comprehensive quantitative metrics, including FAD and various subjective evaluations (Nat.-MOS, Emo.-MOS, Help.), which demonstrate the system's effectiveness in emotional alignment and perceived helpfulness. The pilot user study adds valuable qualitative insights into user experience, although it is limited in scope.
The paper includes detailed descriptions of the experimental setup, including hyperparameters and training procedures, which aids in reproducibility. However, the lack of publicly available code or a demo limits the ability for others to replicate the findings fully.
One limitation is the reliance on a relatively small dataset for training, which may affect the generalizability of the model across diverse populations. Additionally, while the pilot study shows promising results, it is not a clinical validation, and further research is needed to establish long-term efficacy and safety in real-world applications.
The potential applications of MindMelody are significant, particularly in mental health interventions, where personalized music therapy could provide non-invasive and cost-effective support for individuals experiencing emotional distress. The integration of EEG data with music generation could pave the way for more adaptive therapeutic tools in the field of affective computing. MindMelody presents a novel approach to EEG-driven personalized music intervention, demonstrating a sophisticated integration of machine learning techniques that enhance the adaptability and effectiveness of music therapy. The paper's contributions to the field of affective computing and music generation are substantial, offering a promising direction for future research and applications in mental health.
Speech technologies are deployed in high-stakes settings, yet fairness concerns remain fragmented across tasks and disciplines. Existing surveys either adopt a general machine-learning perspective that overlooks speech-specific properties or focus on a single task, missing failure patterns shared across the speech domain. Synthesizing over 400 studies spanning generation and perception tasks and emerging speech-language models, this survey presents a unified framework that links formal fairness definitions to evaluation, diagnosis, and mitigation. We formalize seven fairness definitions adapted to the speech modality and organize the field's conceptual evolution through three paradigms: Robustness, Representation, and Governance. We then ground evaluation metrics in the mathematical cores of these definitions and offer a decision tree for metric selection. We diagnose bias sources along the speech processing pipeline, surfacing speech-specific mechanisms such as channel bias as a demographic proxy and annotation subjectivity in emotion labels. We systematize mitigation strategies across four intervention stages, mapping each to the diagnosed sources. Finally, we identify open challenges and propose directions for future research.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of Southern California, NTU Artificial Intelligence Center of Research Excellence
This paper serves as a foundational survey that systematically addresses bias and fairness in speech AI, providing a comprehensive framework that can guide future research and development in this critical area. The authors' approach to synthesizing existing literature and formalizing fairness definitions is a significant contribution to the field, setting the stage for more equitable speech technologies.
The paper presents a comprehensive survey that synthesizes over 400 studies related to bias and fairness in speech AI, establishing a unified framework that links formal fairness definitions to evaluation metrics, bias diagnosis, and mitigation strategies. The authors formalize seven fairness definitions specifically adapted to the speech modality and provide a decision tree for metric selection, which is a novel contribution to the field. The methodology is robust, drawing on a wide range of literature and systematically addressing the unique challenges posed by the speech domain.
While the paper is primarily a survey and does not include original experimental results, it effectively reviews existing literature and identifies gaps in current methodologies. It categorizes bias sources along the speech processing pipeline and systematizes mitigation strategies, which could serve as a foundation for future empirical studies. The depth of analysis into bias mechanisms and fairness paradigms is commendable, although the lack of original experimental validation limits the immediate applicability of the findings.
The survey does not present original experiments, thus reproducibility in the traditional sense does not apply. However, the clear organization of existing literature and the proposed frameworks allow for future researchers to build upon this work in a reproducible manner. The decision tree for metric selection is particularly useful for guiding future empirical studies.
One limitation of the paper is its reliance on existing literature without presenting new empirical data or case studies to validate the proposed frameworks. Additionally, while the survey covers a wide range of topics, it may not address all nuances of bias and fairness in speech technologies, particularly in emerging areas of research. The authors also acknowledge the complexity of navigating fairness in sociotechnical contexts, which may not be fully captured in their framework.
The implications of this work are significant, as it addresses critical issues of bias and fairness in speech technologies that are increasingly deployed in high-stakes environments. By highlighting the need for fairness as a core requirement rather than an afterthought, the paper encourages researchers and practitioners to consider the ethical implications of their technologies. This survey could influence future research directions and policy-making in the field of AI and speech technology. This paper serves as a foundational survey that systematically addresses bias and fairness in speech AI, providing a comprehensive framework that can guide future research and development in this critical area. The authors' approach to synthesizing existing literature and formalizing fairness definitions is a significant contribution to the field, setting the stage for more equitable speech technologies.
Audio-visual quality assessment (AVQA) is essential for streaming, teleconferencing, and immersive media. In realistic streaming scenarios, distortions are often asymmetric, where one modality may be severely degraded while the other remains clean. Still, most contemporary AVQA metrics treat audio and video as equally reliable, causing confidence-unaware fusion to emphasize unreliable signals. This paper proposes MCM-AVQA, a multimodal confidence-aware AVQA framework that explicitly estimates modality-specific confidence and injects it into a dedicated audio-visual mixer for cross-modal attention. The Audio-Visual Mixer utilizes frame-level, confidence-guided channel attention to gate fusion, modulating feature interaction between modalities so that high-confidence streams dominate while unreliable inputs are suppressed, preserving temporal degradation patterns. A multi-head visual confidence estimator turns frame-level artifact probabilities into temporally smoothed, clip-level visual confidence scores, while an audio confidence module derives confidence from speech-quality cues without requiring a clean reference. Experiments on multiple AVQA benchmarks show that MCM-AVQA, and specifically its confidence-guided Audio-Visual Mixer, improve correlation with human mean opinion scores and yield more interpretable behavior under real-world asymmetric audio-visual distortions.
Primary: Texas State University
All Institutions: Texas State University
The paper presents MCM-AVQA, a confidence-aware audio-visual quality assessment framework that improves the robustness of quality evaluation under asymmetric distortions. This work significantly advances the state of the art in AVQA by integrating modality-specific confidence into the fusion process, leading to more accurate and interpretable quality assessments.
The proposed MCM-AVQA framework introduces a novel approach to audio-visual quality assessment by explicitly modeling modality-specific confidence and integrating it into a dedicated Audio-Visual Mixer. This methodology allows for dynamic feature gating based on confidence levels, which is a significant advancement over traditional methods that treat audio and video as equally reliable. The use of a multi-head visual confidence estimator and an audio confidence module enhances the robustness of the model under asymmetric distortions, which is a common scenario in real-world applications. The architecture is well-structured, leveraging state-of-the-art transformer models and attention mechanisms, making it a strong contribution to the field.
The experiments conducted across multiple AVQA benchmarks (LIVE-SJTU, UnB-AV, UnB-AVQ) demonstrate the effectiveness of MCM-AVQA in improving correlation with human mean opinion scores. The results indicate that the model outperforms existing state-of-the-art methods, particularly in scenarios with asymmetric distortions. The ablation studies provide valuable insights into the contributions of each component of the model, reinforcing the importance of confidence-aware fusion. The use of statistical tests to validate performance improvements adds rigor to the evaluation.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the results directly. Including a project URL or demo would significantly enhance reproducibility.
One limitation of the study is the lack of a comprehensive comparison with more recent AVQA methods that may not have been included in the evaluation. Additionally, while the model shows robustness under asymmetric distortions, its performance in extreme distortion scenarios or with novel types of distortions remains untested. The reliance on subjective mean opinion scores for evaluation, while standard, could also introduce variability based on human judgment.
The MCM-AVQA framework has significant implications for real-world applications in streaming, teleconferencing, and immersive media, where audio-visual quality is critical. By improving the accuracy of quality assessments in asymmetric distortion scenarios, this work can enhance user experiences in various multimedia applications. The approach could also be extended to other multimodal quality assessment tasks, potentially influencing future research directions in the field. The paper presents MCM-AVQA, a confidence-aware audio-visual quality assessment framework that improves the robustness of quality evaluation under asymmetric distortions. This work significantly advances the state of the art in AVQA by integrating modality-specific confidence into the fusion process, leading to more accurate and interpretable quality assessments.
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.
Primary: Amazon AGI
All Institutions: Amazon AGI, National Taiwan University
The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.
The proposed methodology introduces a novel one-step sampling framework for text-to-audio generation that integrates an energy-distance training objective with representation-level distillation. This approach effectively reduces inference latency while maintaining audio quality, addressing a significant limitation in existing autoregressive models that rely on multi-step sampling. The use of energy-scoring to map Gaussian noise directly to audio latents is innovative and demonstrates a clear departure from traditional diffusion-based methods. The incorporation of distillation from a masked autoregressive model further enhances the model's performance, showcasing a thoughtful combination of techniques to achieve rapid and high-quality audio synthesis.
The experimental evaluation is comprehensive, utilizing the AudioCaps benchmark for both objective and subjective assessments. The paper reports consistent improvements over existing one-step baselines, with significant gains in fidelity and semantic relevance as measured by various metrics (FD, FAD, KL, IS, CLAP). The results demonstrate not only superior performance compared to prior models but also a substantial reduction in inference time, achieving up to 8.5 times faster batch inference than the state-of-the-art AR diffusion system, IMPACT. The thoroughness of the experiments, including ablation studies on representation distillation and classifier-free guidance, adds credibility to the findings.
The paper provides detailed descriptions of the experimental setup, including datasets, model configurations, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability of other researchers to replicate the results directly. Clear documentation of hyperparameters and training procedures is essential for future work in this area.
While the proposed method shows promising results, it still falls short of the audio quality achieved by multi-step diffusion models, indicating that there may be inherent trade-offs between speed and fidelity. The reliance on a single sampling step may also limit the model's flexibility in generating more complex audio sequences. Additionally, the paper does not address potential biases in the training datasets, which could affect the generalizability of the model.
The advancements in low-latency text-to-audio generation have significant implications for real-time applications in multimedia content creation, interactive media, and personalized audio experiences. The ability to generate high-quality audio quickly opens up new avenues for user engagement and creative expression. Furthermore, the integration of energy-distance training and representation distillation could inspire future research in other generative tasks across different modalities. The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.
In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.
Primary: Fudan University
All Institutions: Fudan University, ByteDance
The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.
The methodology presented in GaMMA is robust, utilizing a dual-encoder architecture that effectively captures both temporal and non-temporal aspects of music understanding. The mixture-of-experts approach, combined with a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning), is innovative and addresses existing gaps in music LMMs. The introduction of MusicBench as a comprehensive benchmark for evaluating music understanding adds significant value to the methodology, allowing for a nuanced assessment of model capabilities.
The experiments conducted demonstrate the effectiveness of GaMMA, achieving state-of-the-art results on multiple benchmarks, including MusicBench and MuChoMusic. The extensive evaluation across various dimensions of music understanding, including temporal reasoning and global attributes, showcases the model's capabilities. The use of human-curated questions in MusicBench enhances the credibility of the results, though the paper could benefit from more extensive comparisons with a wider range of existing models.
The paper provides detailed implementation specifics, including training strategies, hyperparameters, and data curation processes, which are essential for reproducibility. However, the absence of publicly available code or datasets limits the ability for independent verification of results.
One limitation is the reliance on curated datasets, which may introduce biases or limit the generalizability of the model. Additionally, while the dual-encoder approach is innovative, it may require significant computational resources, which could hinder accessibility for broader research applications.
GaMMA has the potential to significantly impact the field of music understanding and multimodal AI by providing a framework that can be adapted for various applications, such as music recommendation systems, educational tools, and interactive music assistants. Its ability to understand and reason about music in a nuanced manner could lead to advancements in how machines interact with human creativity and cultural expressions. The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.
The paper introduces a novel approach using a Language-Adversarial Speaker Encoder (LASE) that effectively disentangles language from speaker identity in multilingual voice cloning tasks. The methodology employs a gradient-reversal layer and a supervised contrastive loss to create a speaker embedding that is invariant to language, which is a significant advancement in the field. The architecture is well-defined, consisting of a frozen WavLM-base-plus backbone and a trainable projection head, which allows for efficient training and effective performance on cross-script tasks.
The experiments are robust, utilizing two distinct corpora to evaluate the performance of LASE against established baselines (WavLM-base-plus-sv and ECAPA-TDNN). The results demonstrate a significant reduction in the identity gap across scripts, with LASE achieving a gap of 0.013 compared to 0.082 and 0.105 for the baselines. The paper also includes a thorough analysis of the training dynamics and presents a synthetic multi-speaker diarisation benchmark, showing that LASE can match ECAPA-TDNN's performance with significantly less training data.
The authors provide a comprehensive set of resources, including the model weights, training corpus, and evaluation scripts, which enhances reproducibility. The detailed description of the training process, loss functions, and hyperparameters further supports the ability of other researchers to replicate the results.
The study relies solely on synthetic data generated by ElevenLabs, which may not fully capture the complexities of natural human speech. Additionally, the held-out set shares voices with the training data, limiting the generalization assessment. The paper also acknowledges that the model's performance on real-world data and new voices remains to be evaluated.
The implications of this work are significant for applications in multilingual voice cloning, speaker verification, and diarisation systems, particularly in contexts involving Indian languages. The ability to maintain speaker identity across different scripts can enhance user experience in customer support, content creation, and accessibility technologies. The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.
Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.
The proposed MMAudio-LABEL framework innovatively combines audio generation with event labeling in a unified architecture, addressing the limitations of traditional post-hoc sound event detection methods. By leveraging a multimodal transformer and exploring two distinct architectures (Parallel Heads and Joint Heads), the authors demonstrate a thoughtful approach to integrating visual and auditory information. The methodology is well-structured, with clear explanations of the model architecture and training objectives, although further details on the training data preprocessing and augmentation strategies could enhance clarity.
The experiments are robust, utilizing the Greatest Hits dataset to evaluate both onset detection and material classification. The reported improvements in accuracy metrics (from 46.7% to 75.0% for onset detection and from 40.6% to 61.0% for material classification) provide compelling evidence of the framework's effectiveness. However, the paper could benefit from additional comparative analyses against a wider range of baseline models to contextualize the performance gains further.
The implementation details are adequately described, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the trained models or code would significantly enhance the paper's impact and usability for the research community.
One notable limitation is the reliance on a specific dataset (Greatest Hits), which may not fully represent the diversity of audio events in real-world scenarios. Additionally, the model's performance on less distinctive materials indicates potential challenges in generalization. The paper could also discuss the computational complexity and resource requirements of the proposed framework.
The MMAudio-LABEL framework has significant implications for content creation, immersive media, and human-computer interaction, as it enables more intuitive sound event labeling from silent videos. This could streamline workflows in various industries, including film production and gaming, where accurate audio representation is crucial. The integration of audio generation and event labeling also opens avenues for future research in multimodal learning and generative models. The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.
We present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address these challenges, MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models.
Primary: University of Maryland, College Park, MD, USA
All Institutions: Centific Global Solutions Inc., University of Maryland, College Park, MD, USA
The paper presents MedMosaic, a large-scale medical audio question-answering benchmark designed to evaluate audio reasoning models under realistic clinical constraints. This work is significant as it addresses a critical gap in the evaluation of multimodal reasoning in the medical domain, providing a structured framework for future research and development in audio understanding and reasoning.
The methodology presented in this paper is robust, featuring a comprehensive pipeline for generating question-answer pairs from diverse medical audio sources. The authors effectively address the challenges of collecting and annotating medical audio data by leveraging synthetic audio generation techniques. The structured approach to creating varied question types (e.g., sound-only, speech-only, multi-turn) is commendable, as it allows for a nuanced evaluation of audio reasoning capabilities. The use of subject matter experts for validation adds credibility to the dataset's clinical relevance. However, the reliance on synthetic data raises questions about the authenticity of the generated audio and its implications for real-world applications.
The experimental evaluation is thorough, benchmarking 13 different audio and multimodal reasoning models against the MedMosaic dataset. The results demonstrate significant performance challenges across all models, highlighting the dataset's difficulty and the need for further advancements in medical audio reasoning. The detailed breakdown of model performance across various question types provides valuable insights into the strengths and weaknesses of current systems. However, the paper could benefit from more extensive comparisons with existing benchmarks to contextualize the results further.
The paper provides a detailed description of the dataset generation process and the evaluation framework, which aids in reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings. The authors should consider releasing the dataset and the generation pipeline to enhance reproducibility and facilitate further research in this area.
The primary limitation of this work lies in the reliance on synthetic audio, which may not fully capture the complexities of real-world medical audio scenarios. Additionally, while the dataset is extensive, the performance of state-of-the-art models remains relatively low, indicating that the benchmark may still be too challenging for current systems. The authors acknowledge the need for further validation before clinical deployment, which is a critical consideration for any application in healthcare.
The development of MedMosaic has the potential to significantly advance the field of medical audio processing and reasoning. By providing a challenging benchmark, it encourages the development of more sophisticated models capable of understanding and reasoning over complex medical audio. This could ultimately lead to improved clinical decision-making and patient outcomes. However, the authors emphasize the importance of extensive validation before any real-world application, highlighting the need for caution in deploying AI systems in healthcare settings. The paper presents MedMosaic, a large-scale medical audio question-answering benchmark designed to evaluate audio reasoning models under realistic clinical constraints. This work is significant as it addresses a critical gap in the evaluation of multimodal reasoning in the medical domain, providing a structured framework for future research and development in audio understanding and reasoning.
While current federated multimodal continual learning over mixture-of-experts low-rank adaptation (MoE-LoRA) is built on the unverified assumption that routing isolates task-specific knowledge into disjoint experts, we argue that routing operates per-sample, while forgetting accumulates across the task sequence, and gradient conflict persists within each expert even when routing is maximally polarized. Moreover, activation-subspace protection can also fail because, under parameter-efficient fine-tuning, it entangles tasks due to a dimension-counting bound, and federated averaging (FedAvg) disrupts client-side orthogonality. To address this, we propose PRISM (Per-expert Routing-projection Interference-informed Subspace Method), which maintains a per-expert gradient subspace basis whose orthogonality is preserved under FedAvg and reinterprets MoE routing as a capacity allocator. Our results show that, on LLaVA-1.5-7B, LLaVA-1.5-13B, and Qwen2.5-VL-7B across CoIN-6 and CoIN-Long-10, PRISM outperforms sixteen the state of the art baselines in average accuracy. Compared to the best federated multimodal baseline, the performance margin increases from +3.23 pp on CoIN-6 to +6.06 pp on CoIN-Long-10.
Primary: South Dakota State University
All Institutions: South Dakota State University
The main contribution of this paper is the introduction of PRISM, a novel approach that effectively resolves issues of spurious isolation in federated multimodal continual learning by maintaining orthogonality in gradient subspaces and reinterpreting routing mechanisms. The comprehensive analysis of the methodology, experimental results, and potential applications underscores its significance in advancing the field of federated learning.
The paper introduces PRISM, a novel method addressing the limitations of existing federated multimodal continual learning approaches. The methodology is well-structured, focusing on the preservation of orthogonality in gradient subspaces and reinterpreting MoE routing as a capacity allocator. The proposed mechanisms, including the Per-Expert Federated Orthogonal Subspace Union (PE-FOSU) and interference-informed scheduling, are innovative and effectively tackle the identified issues of spurious isolation and entangled activation subspaces. The authors provide a clear theoretical foundation for their approach, which is crucial for understanding the underlying principles of their method.
The experimental setup is robust, evaluating PRISM against sixteen state-of-the-art baselines across two multimodal benchmarks (CoIN-6 and CoIN-Long-10). The results demonstrate significant improvements in average accuracy and backward transfer, with detailed comparisons that highlight the advantages of the proposed method. The paper includes comprehensive analyses of the results, showcasing the effectiveness of PRISM in various scenarios.
The paper provides sufficient implementation details, including the architecture, training protocols, and evaluation metrics. However, the absence of a public code repository or demo URL limits the reproducibility of the results. Future work should consider making the code available to facilitate validation by the research community.
While the proposed method shows promise, the paper does not address the computational overhead associated with maintaining per-expert gradient subspaces, which could be a concern in large-scale applications. Additionally, the evaluation is limited to specific multimodal benchmarks, and further testing on diverse datasets would strengthen the findings.
The implications of this research extend to various applications in federated learning, particularly in scenarios where data privacy is paramount. By enhancing the performance of multimodal continual learning systems, PRISM could contribute to advancements in areas such as personalized AI, healthcare, and collaborative learning environments. The main contribution of this paper is the introduction of PRISM, a novel approach that effectively resolves issues of spurious isolation in federated multimodal continual learning by maintaining orthogonality in gradient subspaces and reinterpreting routing mechanisms. The comprehensive analysis of the methodology, experimental results, and potential applications underscores its significance in advancing the field of federated learning.
Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic intent, but it remains underexplored because dance requires simultaneous reasoning over linguistic semantics, musical rhythm, and full-body motion dynamics. We introduce TD-Data, a large-scale open dataset for text-dance retrieval, containing about 4,000 12-second dance clips, 14.6 hours of motion, 22 genres, and annotations from professional dance experts. On top of this dataset, we propose CustomDancer, a multimodal retrieval framework that aligns text with dance through a CLIP-based text encoder, music and motion encoders, and a music-motion blending module. CustomDancer achieves state-of-the-art performance on TD-Data, reaching 10.23% Recall@1 and improving retrieval quality in both quantitative benchmarks and user preference studies.
Primary: South-Central Minzu University
All Institutions: South-Central Minzu University
The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.
The methodology is robust, introducing a novel multimodal retrieval framework (CustomDancer) that effectively combines text, music, and motion through a well-structured architecture. The use of a CLIP-based text encoder alongside dedicated music and motion encoders is innovative, allowing for a more nuanced understanding of dance retrieval. The music-motion blending module is particularly noteworthy as it captures the interaction between music and motion, which is crucial for dance. The construction of the TD-Data dataset with expert annotations adds significant value, providing a solid foundation for training and evaluation.
The experiments are comprehensive, utilizing multiple evaluation metrics (Recall@K, Median Rank, Mean Rank) that are appropriate for the task. The comparison with strong baselines demonstrates the effectiveness of CustomDancer, and the user study adds a qualitative dimension to the evaluation, confirming that the model aligns well with human judgments. The ablation studies provide insights into the contributions of different components of the model, reinforcing the importance of temporal modeling and feature fusion.
The paper provides detailed implementation details, including the architecture of the encoders and the training objectives. However, the lack of a publicly available code repository or dataset could hinder reproducibility. Future work should consider releasing the code and dataset to facilitate further research in this area.
The paper acknowledges several limitations, including challenges with specialized terminology, conflicts between visual motion and musical affect, and potential performer bias. These factors can impact retrieval accuracy and user satisfaction. Additionally, the dataset's focus on 3D motion and music may overlook important visual elements like costumes and facial expressions.
The work has the potential to significantly impact the fields of dance education, choreography, and creative recommendation systems. By making dance retrieval more accessible, it can facilitate learning and exploration of diverse dance styles. However, the authors emphasize the need for cultural sensitivity in dataset construction and application, highlighting the importance of preserving the context and community significance of dance styles. The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.
To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework. Unlike previous approaches that predict combination weights of sub control filters, the proposed method directly generates control filters in an unsupervised manner by integrating the co-processor and real-time controller into a fully differentiable ANC system, where the accumulated error signal is used as the training objective. By abandoning the decomposition--reconstruction process, the proposed design simplifies the control pipeline and avoids error accumulation, while the Transformer architecture effectively captures global and dynamic noise characteristics through its attention mechanism. Numerical simulations on real-recorded noises demonstrate that the proposed method achieves improved noise reduction performance and adaptability to different types of noises compared with the original GFANC framework.
Primary: unknown
All Institutions: unknown
The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.
The proposed Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework represents a significant methodological advancement in active noise control (ANC) by integrating a Transformer architecture for direct control-filter generation. This approach eliminates the need for sub-filter decomposition and recombination, which simplifies the control pipeline and enhances adaptability to varying noise conditions. The unsupervised training paradigm, which relies on minimizing the accumulated residual error, is innovative as it reduces the dependency on labeled data, a common limitation in many machine learning applications. The use of a differentiable ANC system allows for end-to-end training, which is a notable strength of the methodology.
The experimental setup is robust, utilizing a large synthetic dataset of 83,977 noise samples and evaluating the model's performance on both unseen real-world and synthetic noises. The results indicate that the proposed method outperforms the existing GFANC framework in most real-noise scenarios, demonstrating its practical applicability. However, the performance on synthetic noises is mixed, suggesting that while the model excels in real-world conditions, it may not universally outperform all existing methods across all noise types. The evaluation metrics used, particularly the noise reduction (NR) levels, are appropriate for assessing ANC performance.
The paper provides sufficient detail regarding the model architecture, training parameters, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. Future work could benefit from sharing the implementation details and datasets used for training and testing.
One significant limitation is the reliance on a fixed acoustic path during training and evaluation, which may not generalize well to different acoustic environments without retraining the model. Additionally, the increased complexity of the Transformer-based model, while beneficial for performance, raises concerns about computational efficiency and resource requirements, which could limit its deployment in real-time applications.
The proposed framework has the potential to significantly improve active noise control systems in various applications, including consumer electronics, automotive, and industrial environments. By enhancing adaptability to dynamic noise conditions, this research could lead to more effective noise cancellation solutions, improving user experience and comfort in noisy environments. The implications for real-time processing and deployment in practical scenarios are promising, although further work is needed to address the identified limitations. The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.
Primary: Reality Defender Inc.
All Institutions: Reality Defender Inc., INRS
The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.
The paper introduces a novel pretraining framework for voice deepfake detection, Alethia, which innovatively combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. This dual-branch approach allows the model to learn robust representations that capture generative artifacts in voice deepfakes, addressing limitations in existing speech foundation models (SFMs) that primarily focus on downstream finetuning. The methodology is well-structured, with a clear explanation of the model architecture, pretraining objectives, and the rationale behind the design choices, such as the use of continuous embeddings instead of discrete tokens.
The experimental evaluation is comprehensive, covering five different tasks across 56 benchmark datasets, which is a significant contribution to the field. The results demonstrate that Alethia outperforms existing SFMs in various metrics, including equal error rate (EER) and accuracy, particularly in challenging scenarios. The zero-shot generalization capability to unseen domains, such as singing deepfakes, is a notable strength of the model. However, the paper could benefit from more detailed ablation studies to further validate the contributions of each component in the proposed framework.
The paper provides a thorough description of the experimental setup, including data preprocessing, model architecture, and training procedures. However, the lack of publicly available code or datasets limits reproducibility. Providing a GitHub repository or links to the datasets used would enhance the ability of other researchers to replicate the findings.
One limitation of the study is the reliance on self-curated datasets for pretraining, which may introduce biases or artifacts not present in real-world data. Additionally, while the model shows promising results, its performance on edge cases or highly diverse datasets remains to be fully explored. The paper also does not address potential ethical implications of deepfake technology, which is crucial given the sensitive nature of the application.
The research has significant implications for the field of audio processing and deepfake detection, contributing to the development of more robust systems that can help mitigate the risks associated with the misuse of deepfake technology. As deepfakes become more prevalent, the ability to detect and localize them effectively is crucial for maintaining trust in digital communications. The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.
Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, National Center for Supercomputing Applications
The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.
The proposed methodology effectively combines few-shot learning with LLM-guided phoneme editing to address the challenge of accent adaptation in ASR systems. The approach is innovative in its use of a phoneme-conditioned TTS model and the integration of LLMs for phoneme editing, which allows for accent-specific pronunciation adjustments while maintaining prosodic alignment. The system's architecture is well-defined, and the use of a matched-rate random phoneme baseline provides a strong comparative framework to evaluate the effectiveness of the LLM-guided edits.
The experiments are comprehensive, evaluating the proposed method across multiple accents (Indian and Korean English) and demonstrating significant improvements in WER through synthetic data generation. The paper provides a clear experimental setup, including detailed descriptions of the datasets, evaluation metrics, and results. The findings indicate that the proposed method not only enhances ASR performance in low-resource scenarios but also shows potential for cross-speaker generalization, which is a critical aspect of practical ASR applications.
The paper includes sufficient implementation details, including training configurations, feature extraction methods, and evaluation protocols, which support reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their code and models to enhance reproducibility.
One notable limitation is that the system inherits prosody from the source speech rather than modeling accent-specific prosodic variations, which may restrict the fidelity of the synthesized speech. Additionally, the adaptation is limited to a single reference speaker, which could affect the generalizability of the results across different speakers and accents. Future work should address these limitations by exploring multi-speaker accent generation and explicit prosody modeling.
The research has significant implications for improving ASR systems in diverse linguistic contexts, particularly for underrepresented accents. By enabling effective accent adaptation with minimal data, this work can contribute to more inclusive speech technologies that better serve global populations. The potential applications extend to various domains, including voice assistants, transcription services, and accessibility tools, enhancing communication for speakers of different accents. The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.
Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted and ~1200 legitimate synthetic robocall samples across three realistic adversarial axes: psycholinguistics-manipulated transcripts, emotion-eliciting speech, and cloned voices. We further propose RoboKA, a Kolmogorov-Arnold Network (KAN)-based multimodal fusion framework designed to model structured nonlinear interactions between acoustic and linguistic cues that characterize diverse adversarial robocall strategies. RoboKA first leverages cross-modal contrastive learning to align latent modality representations and feeds the resulting embeddings to a KAN-projection head for final classification. We benchmark RoboKA against strong unimodal and multimodal baselines in both in-domain and out-of-domain setups, finding RoboKA to surpass all baselines in terms of recall and F1-score.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, George Mason University
The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.
The methodology presented in this paper is robust and innovative, leveraging a novel dataset (Robo-SAr) that addresses the limitations of existing datasets in robocall research. The use of Kolmogorov-Arnold Networks (KAN) for multimodal fusion is a significant advancement, as it allows for the modeling of complex nonlinear interactions between audio and text modalities. The cross-modal contrastive learning approach enhances the alignment of representations, which is crucial for effective robocall detection. The authors also provide a clear explanation of their methods and the rationale behind their choices, making the methodology both sound and well-justified.
The experimental evaluation is comprehensive, benchmarking RoboKA against various unimodal and multimodal baselines under different conditions, including in-domain and out-of-domain setups. The results demonstrate a clear performance advantage for RoboKA, particularly in challenging scenarios, which underscores the effectiveness of the proposed approach. The use of human validation for the dataset adds credibility to the findings, although the paper could benefit from more detailed statistical analysis of the results.
The paper commits to releasing the dataset and code upon review, which is a positive step towards ensuring reproducibility. However, the lack of explicit URLs for accessing the dataset and code is a drawback. The methodology is described in sufficient detail to allow for replication, but the absence of a demo or project URL limits immediate accessibility for other researchers.
The paper acknowledges several limitations, including the focus on English language robocalls, which restricts the applicability of the findings to multilingual contexts. Additionally, the reliance on synthetic data raises questions about the generalizability of the results to real-world scenarios. The authors also note that the dataset may not fully capture the complexities of real-world robocalls, which could impact the robustness of the model in practical applications.
The implications of this research are significant, particularly in the context of increasing robocall threats. By providing a robust framework for detecting deceptive robocalls, this work has the potential to enhance consumer protection and inform regulatory efforts. The methodology could also be adapted for other domains where multimodal deception detection is relevant, such as phishing or online scams. The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.
We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.
Primary: University of Stellenbosch
All Institutions: University of Stellenbosch
The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.
The paper introduces a novel approach to elephant call classification using pretrained acoustic embeddings without fine-tuning, which is significant given the scarcity of annotated bioacoustic data. The methodology is well-structured, employing a variety of embedding models from different domains and evaluating their performance with lightweight classifiers. The choice to analyze intermediate layers of transformer models for their efficacy in classification is particularly innovative, providing insights into the model's internal representations. The segmentation and classification processes are clearly defined, ensuring a robust experimental design.
The experiments are comprehensive, utilizing two distinct datasets for evaluation, which enhances the validity of the results. The performance metrics, including AUC and MAP, are appropriate for the classification task and allow for a nuanced understanding of model effectiveness. The results demonstrate that the best-performing embedding model, Perch 2.0, achieves competitive performance compared to end-to-end supervised models, highlighting the potential of using out-of-domain embeddings in low-resource settings.
The paper provides sufficient detail regarding the experimental setup, including data segmentation, model configurations, and hyperparameter tuning, which supports reproducibility. However, the lack of publicly available code or datasets limits the ease with which other researchers can replicate the study.
One notable limitation is the reliance on pretrained models that may not be strictly out-of-species, particularly with Perch 2.0, which raises questions about the generalizability of the findings. Additionally, the paper does not address potential biases in the datasets or the implications of using embeddings from models trained on other species.
The implications of this research extend beyond elephant call classification, as it demonstrates the utility of pretrained embeddings in bioacoustics, potentially influencing conservation strategies and wildlife management. The approach could be adapted for other endangered species, promoting the use of machine learning in ecological research and conservation efforts. The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.
Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second clips) to predict whether the next contiguous clip contains any disfluency. (1) Severity-selective precursor signal: on the episode-grouped test set, aggregate preblock AUC is modest (0.581 [0.542, 0.619]), but stratifying by upcoming event type reveals concentration on clinically severe events -- blocks 0.601 [0.554, 0.651] and sound repetitions 0.617 [0.567, 0.667] both exclude chance, while fillers (0.45) and word repetitions (0.49) are at chance. The aggregate objective converges to a severity-selective predictor because severe events carry prosodic precursors; fillers do not. (2) Cross-population transfer: without fine-tuning, the same checkpoint applied to 1,024 pediatric Children-Who-Stutter utterances (FluencyBank Teaching) attains AUC 0.674 detection and 0.655 prediction; DisfluencySpeech and LibriStutter reach 0.58-0.60 AUC. (3) Deployable on-device: lossless export to CoreML (1.19 MB), ONNX (40 KB), TFLite. Neural-Engine latency per 3 s window: 0.25 ms (iPhone 17 Pro Max, A19 Pro) to 0.55 ms (iPhone SE 3rd-gen and M1 Max). A 4 Hz streaming simulation uses 0.54% of the real-time budget. Platt-calibrated outputs (test ECE 0.010, from 0.177 raw). Five negative ablations -- output-level Future-Guided Learning, multi-clip GRU, time-axis concatenation, asymmetric focal loss, direct block-targeted training -- none improved over the vanilla baseline.
Primary: Kozak Technologies Inc
All Institutions: Kozak Technologies Inc
The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.
The paper employs a convolutional neural network (CNN) architecture specifically designed for predicting stuttering events based on audio input. The methodology is robust, utilizing a well-defined dataset (SEP-28k) and employing a clear training objective that focuses on predicting upcoming disfluencies. The stratification of results by severity of disfluency types is a significant methodological strength, allowing for a nuanced understanding of the model's predictive capabilities. The inclusion of negative ablation studies further strengthens the methodology by demonstrating a thorough exploration of potential improvements that did not yield better results.
The experiments are well-structured, with a clear focus on both detection and prediction tasks. The use of multiple datasets, including cross-population transfer evaluations, enhances the credibility of the findings. The reported AUC scores provide a quantitative measure of performance, and the stratified analysis reveals important insights into the model's strengths and weaknesses. The deployment metrics, including on-device latency and model size, are particularly relevant for practical applications, showcasing the model's readiness for real-world use.
The paper emphasizes reproducibility by providing access to the training code, label-generation scripts, and the trained model weights. The detailed description of the training process, including hyperparameters and data preprocessing steps, further supports reproducibility. The inclusion of a catalog of negative results is a commendable practice that aids future research by preventing redundant efforts.
The paper acknowledges several limitations, including the single-clip context that may restrict the model's performance and the potential for variability across different speakers and datasets. The lack of fine-tuning on external datasets raises questions about the generalizability of the model's predictions. Additionally, the reliance on a coarse label for upcoming events could be improved with more precise annotations.
The research has significant implications for the field of speech therapy and assistive technologies for individuals who stutter. By enabling predictive capabilities in real-time, the model could facilitate closed-loop interventions that provide timely feedback to users. The deployment of such technology on consumer devices could enhance accessibility and usability for a broader audience, potentially improving the quality of life for many individuals. The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.
Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model. To cluster participants into conversation groups, we employ Qwen3.5-122B LLM to estimate transcript topic similarity followed by hierarchical agglomerative clustering. On the MCoRec development set, the proposed system achieves 33.7% WER and a clustering F1 score of 0.97, improving over the official baseline by 16.2% WER and 0.15 F1 absolute. On the eval set, our team ranked second, being 0.16% WER and 0.5% F1 worse than the best system.
Primary: Brno University of Technology
All Institutions: Brno University of Technology
This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.
The proposed methodology integrates audio-visual cues into a long-context ASR system, leveraging pre-trained models (NVIDIA Parakeet-v2 and AV-HuBERT) effectively. The use of a gated mechanism for fusing audio and visual features is a notable innovation, allowing the model to dynamically adjust its reliance on each modality. The clustering approach, which employs a large language model (LLM) for semantic topic similarity, represents a significant departure from traditional heuristic methods. This combination of techniques is well-justified and demonstrates a thoughtful approach to addressing the challenges of multi-talker ASR.
The experimental setup is robust, with clear metrics for both transcription (WER) and clustering (F1 score). The authors provide a thorough analysis of their results, showing substantial improvements over the baseline. However, the reliance on synthetic data for training raises questions about the generalizability of the results to real-world scenarios. The evaluation on both development and eval sets, along with comparisons to baseline systems, adds credibility to their findings.
The paper includes sufficient implementation details, including the training regimen, data preprocessing, and the use of specific frameworks (NeMo and DSPy). The availability of the code on GitHub enhances reproducibility, although the authors could provide more detailed instructions for replicating the experiments.
One limitation is the potential domain mismatch between the synthetic training data and the real-world MCoRec dataset, which could affect the model's performance in practical applications. Additionally, while the clustering approach shows promise, its reliance on LLMs may introduce variability based on the model's performance and the quality of the transcripts.
The advancements in multi-talker ASR have significant implications for applications in various fields, including telecommunications, accessibility for the hearing impaired, and human-computer interaction. The integration of visual cues into ASR systems could lead to more robust and accurate transcription services, enhancing communication in noisy environments. This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.
We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.
Primary: Instituto de InvestigaciĂłn en Ciencias de la ComputaciĂłn
All Institutions: Instituto de Investigación en Ciencias de la Computación, Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Facultad de Medicina, Centro de Neurociencias Cognitivas, Universidad de Chile, Universidad de San Andrés
The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.
The methodology presented in the paper is robust and well-structured, focusing on the detection of spurious correlations in speech datasets. The authors introduce a systematic approach that leverages non-speech regions of audio to diagnose potential biases in datasets, which is a significant advancement in ensuring the reliability of machine learning models in high-stakes applications. The toolkit's design, which includes careful selection of voice-activity detection systems and feature extraction methods, demonstrates a thorough understanding of the challenges posed by spurious correlations.
The experiments conducted on two Alzheimer's disease speech datasets are comprehensive and well-executed. The authors provide a detailed analysis of the performance of their method against various configurations, including different feature extraction techniques and VAD systems. The use of statistical significance testing adds rigor to their findings, although the reliance on specific datasets may limit generalizability.
The paper offers a clear description of the experimental setup and the toolkit's implementation, which is publicly available on GitHub. This enhances reproducibility, as other researchers can apply the same methods to their datasets. However, the paper could benefit from more detailed instructions on how to utilize the toolkit effectively.
One limitation of the study is the potential overfitting to the specific datasets used for evaluation, which may not represent the broader spectrum of speech datasets. Additionally, while the toolkit addresses spurious correlations, it does not provide solutions for all possible biases that may arise in speech data collection.
The implications of this research are significant, particularly in the context of health-related machine learning applications where spurious correlations can lead to harmful consequences. The toolkit can serve as a critical resource for researchers and practitioners in the field, promoting more reliable and ethical use of speech datasets in machine learning. The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.
Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing the same language. Standard adversarial disentanglement degrades speaker discriminability; blind discriminators inadvertently penalize speaker-discriminative traits that merely correlate with language. To address this, we propose Dual-LoRA, injecting trainable task-factorized LoRA adapters into a frozen pre-trained backbone. Our core innovation is a Language-Anchored Adversary: by grounding the discriminator with an explicit language branch, adversarial gradients target true linguistic cues rather than arbitrary correlations, preserving essential speaker characteristics. Evaluated on the TidyVoice benchmark, our system achieves a 0.91% validation EER and achieves 3rd place in the official challenge.
Primary: Nanjing University
All Institutions: Nanjing University, AISpeech Co, Jiangsu Key Lab of Language Computing, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Soul AI Lab
The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.
The methodology presented in the paper is innovative, particularly in its use of Dual-LoRA, which introduces a parameter-efficient approach to disentangle language and speaker identity in cross-lingual speaker verification. The architecture's design, which incorporates two parallel LoRA streams and a Language-Anchored Adversary, is well-justified and addresses key challenges in the field. The decision to keep the backbone frozen while adapting only the LoRA modules is a strategic choice that enhances the model's efficiency and effectiveness.
The experiments conducted on the TidyVoice benchmark are robust, with a clear focus on evaluating the proposed framework against established baselines. The use of multiple backbones and the systematic analysis of different configurations provide strong evidence for the effectiveness of the Dual-LoRA approach. The reported results, including the significant reduction in EER, particularly in challenging scenarios, underscore the practical impact of the proposed method.
The paper provides sufficient implementation details, including the architecture, training procedures, and hyperparameters, which facilitate reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for others to replicate the results independently.
One notable limitation is the reliance on a single benchmark dataset (TidyVoice) for evaluation, which may not fully capture the generalizability of the proposed method across diverse real-world scenarios. Additionally, while the paper addresses the issue of language-speaker entanglement, it does not explore potential biases that may arise from the training data or the implications of using specific languages.
The proposed Dual-LoRA framework has the potential to significantly enhance cross-lingual speaker verification systems, making them more effective for applications in voice authentication and personalization across different languages. This advancement could lead to broader adoption of voice-based technologies in multilingual contexts, improving accessibility and user experience. The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.
Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental limits of integrating frozen semantic priors -- specifically HuBERT and Whisper -- into neural speech coding. We introduce and quantitatively validate a novel Semantic Retirement phenomenon: while semantic constraints reduce the Word Error Rate (WER) by up to ~10% relatively at 1.5 kbps, their benefits rapidly diminish beyond 6 kbps, indicating a practical capacity boundary. We further uncover a clear trade-off between different prior types: acoustic-rich priors (HuBERT) better preserve prosodic and timbral details, whereas high-level linguistic priors (Whisper) effectively suppress phonetic hallucinations in noisy environments (reducing hallucination rates by 26 percent) and substantially narrow the generalization gap for unseen speakers. Building on these findings, we propose a bitrate-aware regulation strategy that dynamically adjusts prior strength to optimize the trade-off between semantic consistency and perceptual naturalness. Extensive experimental evaluations confirm that our approach achieves competitive intelligibility and noise robustness compared to existing baselines, offering a principled pathway toward ultra-low-bitrate generative speech coding.
Primary: Tsinghua Shenzhen International Graduate School, Tsinghua University
All Institutions: Tsinghua Shenzhen International Graduate School, Tsinghua University, Tencent
This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.
The methodology presented in this paper is robust and well-structured. The authors propose a novel framework that integrates frozen semantic priors (HuBERT and Whisper) into a neural speech codec, addressing the challenges of intelligibility degradation at ultra-low bitrates. The introduction of the "Semantic Retirement" phenomenon is a significant contribution, as it quantitatively defines the limits of semantic guidance in speech coding. The bitrate-aware regulation strategy is particularly innovative, allowing the model to dynamically adjust the strength of semantic constraints based on the bitrate, which is a practical approach to optimize performance across varying conditions.
The experimental evaluation is extensive and well-executed, utilizing the LibriSpeech dataset to validate the proposed framework. The authors provide a thorough analysis of the performance metrics, including Word Error Rate (WER), Perceptual Evaluation of Speech Quality (PESQ), and robustness against noise. The results convincingly demonstrate the effectiveness of the proposed method in improving intelligibility and reducing hallucination rates, particularly in low-bitrate scenarios. The ablation studies further strengthen the findings by isolating the effects of different semantic priors and the regulation strategy.
The paper includes sufficient implementation details, such as the architecture of the neural codec, the configuration of the Residual Vector Quantization, and the training setup. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Providing access to the models and datasets used would enhance the ability of other researchers to replicate and build upon this work.
One limitation is the reliance on frozen semantic priors, which may not capture the full range of acoustic nuances needed for optimal performance in all scenarios. Additionally, the paper primarily focuses on two specific priors (HuBERT and Whisper), which may limit the generalizability of the findings to other types of semantic guidance. The authors also acknowledge the potential for over-smoothing at higher bitrates, which could affect the naturalness of the output.
The findings of this research have significant implications for the development of efficient speech coding systems, particularly in applications where bandwidth is severely limited, such as mobile communications and low-bitrate streaming services. The insights gained from the "Semantic Retirement" phenomenon could inform future research on codec design and the integration of semantic information into other audio processing tasks. The approach could also pave the way for advancements in speech synthesis and recognition systems that require high intelligibility in challenging acoustic environments. This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.
Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.
Primary: Southern University of Science and Technology
All Institutions: Southern University of Science and Technology, Tencent Youtu Lab
The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.
The paper proposes a novel framework for Audio Deepfake Detection (ADD) that emphasizes hard sample classification and utilizes diffusion-based reconstruction methods. The integration of multi-layer feature aggregation and the introduction of Regularization-Assisted Contrastive Learning (RACL) are significant contributions that enhance the model's generalization capabilities. The methodology is well-structured, with clear explanations of the reconstruction paradigms and loss functions employed. However, while the approach is innovative, it builds on existing concepts in contrastive learning and reconstruction, which slightly limits its novelty.
The experiments are comprehensive, evaluating the proposed methods across multiple datasets, including ASVspoof and CodecFake. The results demonstrate a significant reduction in the average Equal Error Rate (EER) compared to baseline models, showcasing the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of different components of the methodology, reinforcing the validity of the findings. However, the paper could benefit from a more detailed analysis of potential edge cases or scenarios where the model may underperform.
The implementation details are sufficiently detailed, including data preprocessing, model architecture, and training parameters, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the results directly.
One limitation is the reliance on specific reconstruction methods, which may not generalize well across all types of audio deepfakes. Additionally, the performance on certain datasets showed minor degradation, suggesting that the model may prioritize generalization over specific artifacts. The paper could also discuss potential biases in the datasets used for training and evaluation.
The implications of this research are significant, particularly in the context of security and misinformation, as robust audio deepfake detection systems are crucial for maintaining trust in audio communications. The proposed framework could be applied in various domains, including cybersecurity, media verification, and social media platforms, where audio authenticity is paramount. The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of Southern California
The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.
The paper employs a systematic approach to evaluate the limitations of the widely adopted EMO-SIM metric for emotional expressiveness in speech generation. It rigorously tests the metric against three criteria: categorical emotion robustness, dimensional emotion sensitivity, and human perception alignment. The methodology includes adversarial sampling, calibration of latent spaces, and a comprehensive evaluation against human judgments, which is a significant strength. However, the lack of a clear new metric or framework to replace EMO-SIM is a notable gap.
The experiments are well-designed, utilizing diverse datasets and multiple evaluation scenarios to assess the performance of EMO-SIM. The results consistently demonstrate the metric's inadequacy in capturing genuine emotional expressiveness, particularly under various acoustic and linguistic distractors. The statistical analyses, including Spearman's correlation and triplet accuracy, provide robust evidence of the findings. However, the paper could benefit from additional comparisons with existing metrics to contextualize its claims further.
The paper provides sufficient detail on the experimental setup, including dataset preparation and evaluation criteria, which aids reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings fully.
The primary limitation is the lack of a proposed alternative metric to EMO-SIM, which leaves a gap in practical applicability. Additionally, the focus on a single metric may overlook other potential evaluation frameworks that could be more effective. The experiments also rely heavily on subjective human evaluations, which may introduce variability.
This work has significant implications for the development of more reliable metrics in speech synthesis and emotional voice conversion, which are critical for applications in human-computer interaction, entertainment, and accessibility technologies. By highlighting the deficiencies of current evaluation methods, it encourages the community to pursue more accurate and meaningful metrics for emotional expressiveness in generated speech. The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.
Conversational multimodal understanding aims to infer the meaning or label of the current utterance from its preceding dialogue context together with textual, acoustic, and visual signals. Existing methods mainly strengthen contextual modeling through enhanced encoding, fusion, or propagation, but rarely abstract the context-utterance dependency into an explicit cue and incorporate it into later multimodal reasoning. To address this issue, we propose CUCI-Net for conversational multimodal understanding. CUCI-Net fully preserves the structural distinction between context and utterance during encoding, effectively abstracts their dependency into an interpretation cue by combining local modality evidence with global contextual evidence, and seamlessly integrates the resulting cue into the final multimodal interaction stage for context-conditioned prediction. Extensive experiments on mainstream benchmark datasets fully demonstrate the effectiveness of the proposed method.
Primary: Zhejiang University
All Institutions: Nanjing University, Zhejiang University
The main contribution of this paper is the introduction of CUCI-Net, a novel framework for conversational multimodal understanding that effectively preserves the context-utterance structure and utilizes an interpretation cue to guide multimodal reasoning, leading to improved performance in sarcasm detection tasks. This work significantly advances the state of the art in multimodal dialogue understanding by addressing key limitations in existing methodologies.
The proposed CUCI-Net introduces a three-stage framework that emphasizes the preservation of context-utterance structure, the abstraction of context-utterance dependencies into an interpretation cue, and the integration of this cue into multimodal reasoning. This methodology is innovative as it directly addresses the limitations of existing models that often overlook the explicit context-utterance relationship in multimodal dialogue understanding. The use of dual-expert encoders and the structured approach to cue-guided interaction represent a significant advancement in the field.
The experiments conducted on the MUStARD and MUStARD++ datasets demonstrate the effectiveness of CUCI-Net, achieving superior performance compared to various strong baselines. The results are rigorously reported, with metrics such as Precision, Recall, and F1-score, and the ablation studies provide clear insights into the contributions of each component of the model. This thorough evaluation strengthens the claims made regarding the model's effectiveness.
The paper provides detailed implementation details, including architecture specifications, optimization settings, and feature extraction methods. However, the absence of a public code repository or demo URL limits the reproducibility of the results, as others cannot easily replicate the experiments or validate the findings independently.
One notable limitation is the reliance on specific datasets (MUStARD and MUStARD++) that may not fully represent the diversity of conversational contexts in real-world applications. Additionally, while the model excels in sarcasm detection, its performance on other forms of non-literal expressions or more complex conversational dynamics remains to be thoroughly evaluated.
The advancements presented in CUCI-Net have potential applications in various domains, including conversational AI, sentiment analysis, and multimodal interaction systems. By improving context-dependent understanding in dialogue systems, this research can enhance user experiences in virtual assistants, customer service bots, and social robots, contributing to more natural and effective human-computer interactions. The main contribution of this paper is the introduction of CUCI-Net, a novel framework for conversational multimodal understanding that effectively preserves the context-utterance structure and utilizes an interpretation cue to guide multimodal reasoning, leading to improved performance in sarcasm detection tasks. This work significantly advances the state of the art in multimodal dialogue understanding by addressing key limitations in existing methodologies.
Multimodal Sentiment Analysis (MSA) requires integrating language, acoustic, and visual signals without sacrificing modality-specific sentiment evidence. Existing methods mainly improve either shared-private decomposition or cross-modal interaction. Although effective, both ultimately depend on how shared and modality-specific evidence is organized before prediction. We observe that, under standard shared-private pipelines, modality heterogeneity often induces a branch-imbalance process: dominant shared patterns accumulate in the shared branch, yielding redundant and modality-biased evidence, while repeated interaction and rigid alignment gradually leak shared information into modality-specific channels and weaken discriminative private representations. As a result, the complementarity between shared and private representations is reduced, limiting robust sentiment reasoning. To address this issue, we propose the Dual-Branch Rebalancing Framework (DBR) on top of a standard multimodal decoupling stage. In the shared branch, a Temporal-Structural Factorization (TSF) module disentangles temporal evolution from structural dependencies and adaptively integrates them to reduce shared redundancy. In the private branch, an Anchor-Guided Private Routing (AGPR) module preserves discriminative modality-specific patterns while allowing controlled cross-modal borrowing. A Bidirectional Rebalancing Fusion (BRF) module then reunifies the two regularized branches in a context-aware manner for final prediction. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that DBR consistently outperforms the compared baselines. Further analyses show that these improvements come from coordinated mitigation of branch imbalance.
Primary: Fudan University
All Institutions: China University of Petroleum-Beijing at Karamay, Fudan University, Peking University, University of Southern California, University of Macau
The paper presents a comprehensive framework for addressing shared-private branch imbalance in multimodal sentiment analysis, contributing valuable insights and methodologies to the field. The innovative approach and rigorous experimental validation position this work as a significant advancement in multimodal representation learning.
The proposed Dual-Branch Rebalancing Framework (DBR) introduces a novel approach to mitigating shared-private branch imbalance in multimodal sentiment analysis. The methodology is well-structured, comprising three main components: Temporal-Structural Factorization (TSF) to disentangle shared representations, Anchor-Guided Private Routing (AGPR) to maintain modality-specific features, and Bidirectional Rebalancing Fusion (BRF) for effective integration. This coordinated design addresses the inherent challenges of modality heterogeneity and redundancy, showcasing a clear understanding of the complexities involved in multimodal representation learning.
The experimental evaluation is robust, utilizing multiple widely recognized benchmarks (CMU-MOSI, CMU-MOSEI, and MIntRec) to validate the effectiveness of DBR. The results demonstrate significant improvements over state-of-the-art baselines across various metrics, indicating the proposed framework's strong performance. The ablation studies further substantiate the contributions of each module, providing insights into their individual impacts on overall performance.
The paper provides sufficient implementation details, including the use of PyTorch, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository or demo limits the practical reproducibility of the results.
While the framework shows promising results, the paper does not address potential limitations such as the scalability of the model to larger datasets or the computational efficiency of the proposed modules. Additionally, the reliance on specific benchmarks may not fully capture the generalizability of the approach across diverse multimodal tasks.
The findings of this research have significant implications for the field of multimodal sentiment analysis, particularly in applications involving human-centered AI systems. By improving the integration of diverse modalities, the proposed framework can enhance the robustness of sentiment prediction in real-world scenarios, potentially benefiting areas like social media analysis, customer feedback interpretation, and emotional AI. The paper presents a comprehensive framework for addressing shared-private branch imbalance in multimodal sentiment analysis, contributing valuable insights and methodologies to the field. The innovative approach and rigorous experimental validation position this work as a significant advancement in multimodal representation learning.
Preserving a speaker's voice identity while generating speech in a different language remains a fundamental challenge in spoken language technology, particularly in specialized domains such as scientific communication. In this paper, we address this challenge through our system submission to the International Conference on Spoken Language Translation (IWSLT 2026), the Cross-Lingual Voice Cloning shared task. First, we evaluate several state-of-the-art voice cloning models for cross-lingual speech generation of scientific texts in Arabic, Chinese, and French. Then, we build voice cloning systems based on the OmniVoice foundation model. We employ data augmentation via multi-model ensemble distillation from the ACL 60/60 corpus. We investigate the effect of using this synthetic data for fine-tuning, demonstrating consistent improvements in intelligibility (WER and CER) across languages while preserving speaker similarity.
Primary: Shaggar Institute of Technology
All Institutions: Shaggar Institute of Technology, Trinity College Dublin
The paper presents a novel approach to cross-lingual voice cloning, demonstrating significant advancements in intelligibility and speaker similarity while addressing the challenges of data scarcity in specialized domains. The methodology and results contribute meaningfully to the field of spoken language technology, particularly in the context of scientific communication.
The paper employs a robust methodology by leveraging ensemble distillation from multiple state-of-the-art voice cloning models to generate high-fidelity synthetic datasets for fine-tuning. The use of Parameter-Efficient Fine-Tuning (LoRA) is particularly noteworthy, allowing the authors to adapt a large foundation model to specific languages while preserving speaker identity. The approach is well-structured, with clear delineation of the data processing, training configuration, and inference pipeline.
The experiments are comprehensive, utilizing a well-defined dataset (ACL 60/60) and a variety of evaluation metrics (WER, CER, and speaker similarity). The results demonstrate consistent improvements in intelligibility and speaker similarity across the three target languages, validating the effectiveness of the proposed methods. The comparative analysis with existing models further strengthens the findings.
The authors provide a public code repository that includes details on data preparation, training, and evaluation, enhancing the reproducibility of their work. However, the limited scale of the dataset may pose challenges for others attempting to replicate the results at a larger scale.
The study is constrained by the size of the distilled training dataset (1,404 samples), which may limit the generalizability of the findings. Additionally, the reliance on automated metrics for evaluation may not fully capture the perceptual quality of synthesized speech, and the paper acknowledges the risks associated with voice cloning technology.
This research has significant implications for enhancing accessibility in scientific communication across different languages, potentially democratizing knowledge dissemination. However, the ethical considerations surrounding voice cloning technologies, such as the potential for misuse, underscore the need for responsible deployment and robust safeguards. The paper presents a novel approach to cross-lingual voice cloning, demonstrating significant advancements in intelligibility and speaker similarity while addressing the challenges of data scarcity in specialized domains. The methodology and results contribute meaningfully to the field of spoken language technology, particularly in the context of scientific communication.
Standard text-to-speech (TTS) evaluation measures intelligibility (WER, CER) and overall naturalness (MOS, UTMOS) but does not quantify accent. A synthesiser may score well on all four yet sound non-native on features that are phonemic in the target language. For Indic languages, these features include retroflex articulation, aspiration, vowel length, and the Tamil retroflex approximant (letter zha). We present PSP, the Phoneme Substitution Profile, an interpretable, per-phonological-dimension accent benchmark for Indic TTS. PSP decomposes accent into six complementary dimensions: retroflex collapse rate (RR), aspiration fidelity (AF), vowel-length fidelity (LF), Tamil-zha fidelity (ZF), Frechet Audio Distance (FAD), and prosodic signature divergence (PSD). The first four are measured via forced alignment plus native-speaker-centroid acoustic probes over Wav2Vec2-XLS-R layer-9 embeddings; the latter two are corpus-level distributional distances. In this v1 we benchmark four commercial and open-source systems (ElevenLabs v3, Cartesia Sonic-3, Sarvam Bulbul, Indic Parler-TTS) on Hindi, Telugu, and Tamil pilot sets, with a fifth system (Praxy Voice) included on all three languages, plus an R5->R6 case study on Telugu. Three findings: (i) retroflex collapse grows monotonically with phonological difficulty Hindi < Telugu < Tamil (~1%, ~40%, ~68%); (ii) PSP ordering diverges from WER ordering -- commercial WER-leaders do not uniformly lead on retroflex or prosodic fidelity; (iii) no single system is Pareto-optimal across all six dimensions. We release native reference centroids (500 clips per language), 1000-clip embeddings for FAD, 500-clip prosodic feature matrices for PSD, 300-utterance golden sets per language, scoring code under MIT, and centroids under CC-BY. Formal MOS-correlation is deferred to v2; v1 reports five internal-consistency signals plus a native-audio sanity check.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents a novel accent evaluation benchmark for Indic TTS systems, offering a detailed and interpretable framework that enhances the understanding of accent fidelity in synthesized speech. The innovative methodology and significant findings position this work as a valuable contribution to the field of machine learning and speech synthesis.
The paper proposes a novel framework, the Phoneme Substitution Profile (PSP), which quantitatively evaluates accent fidelity in Indic languages for TTS systems. The methodology is robust, utilizing a combination of acoustic probes and distributional metrics to capture phonological dimensions of accent. The use of Wav2Vec2 embeddings for forced alignment and the construction of native speaker centroids are particularly innovative, allowing for a detailed analysis of accent features that are often overlooked in traditional TTS evaluations. The six dimensions of accent fidelity (RR, AF, LF, ZF, FAD, PSD) provide a comprehensive approach to understanding TTS performance across different systems and languages.
The experiments benchmark four commercial and open-source TTS systems across Hindi, Telugu, and Tamil, showcasing the effectiveness of the PSP framework. The findings reveal significant insights into the performance of these systems, particularly the divergence between traditional intelligibility metrics (WER) and the proposed accent fidelity metrics. The detailed analysis of results across different languages highlights the varying challenges posed by phonological complexity, making the evaluation both thorough and insightful.
The authors have made a commendable effort to ensure reproducibility by releasing the scoring code and native speaker centroids under open-source licenses. However, the reliance on specific aligners and the current limitations in the quality of these tools may affect the reproducibility of results, particularly for Telugu and Tamil. Future versions of the benchmark are expected to address these issues, enhancing the overall reproducibility.
The paper acknowledges several limitations, including the dependency on forced alignment accuracy, which varies by language, and the potential noise floor in per-phoneme scores. The authors also note that the current version of the PSP does not include formal MOS calibration, which is essential for validating the proposed metrics against human judgment. Additionally, the limited size of pilot sets may affect the statistical significance of some findings.
The PSP framework has the potential to significantly impact the development of TTS systems for Indic languages, providing a much-needed tool for developers to optimize accent fidelity. By focusing on specific phonological features, the framework can help improve the naturalness and intelligibility of synthesized speech, making it more accessible to native speakers. This work also opens avenues for further research into accent evaluation in other languages and dialects, contributing to the broader field of speech synthesis. The paper presents a novel accent evaluation benchmark for Indic TTS systems, offering a detailed and interpretable framework that enhances the understanding of accent fidelity in synthesized speech. The innovative methodology and significant findings position this work as a valuable contribution to the field of machine learning and speech synthesis.
Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.
Primary: StepFun
All Institutions: StepFun, Nanyang Technological University, University of New South Wales, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of Step-Audio-R1.5, a novel audio reasoning model that integrates RLHF to enhance the quality of multi-turn dialogues, addressing the limitations of existing models that prioritize isolated correctness over conversational naturalness. This work represents a significant step forward in developing more empathetic and engaging audio interaction systems, setting a new standard for future research in audio language models.
The methodology is robust and innovative, introducing a new paradigm in audio language models by integrating Reinforcement Learning from Human Feedback (RLHF) to address the limitations of Reinforcement Learning with Verified Rewards (RLVR). The paper effectively outlines a structured approach that includes a mid-training stage, cold-start supervised fine-tuning, and a novel reward model that captures both explicit and implicit quality metrics. This combination is significant as it aims to enhance the naturalness and emotional engagement of audio interactions, which is a critical aspect often overlooked in traditional models.
The experimental evaluation is comprehensive, utilizing multiple benchmarks, including the newly proposed AudioMultiChallenge and Step-Caption, which are well-designed to assess various dimensions of audio reasoning and dialogue quality. The results indicate that Step-Audio-R1.5 performs competitively against leading models, demonstrating significant improvements in multi-turn dialogue scenarios. The use of diverse datasets and rigorous evaluation metrics strengthens the findings.
The paper provides a clear description of the architecture and training process, which aids in reproducibility. However, it lacks detailed implementation specifics such as hyperparameters and training duration, which are essential for fully replicating the experiments. The availability of the project URL is a positive aspect, as it may contain additional resources for implementation.
One limitation is the potential over-reliance on human feedback, which may introduce biases based on the evaluators' preferences. Additionally, while the model shows improvements in conversational quality, the paper does not extensively discuss how it handles edge cases or unexpected user inputs, which are common in real-world applications.
The proposed model has the potential to significantly advance the field of audio language processing by improving user interactions in conversational AI systems. This could lead to more engaging and emotionally aware audio applications in various domains, including virtual assistants, customer service, and entertainment. The main contribution of this paper is the introduction of Step-Audio-R1.5, a novel audio reasoning model that integrates RLHF to enhance the quality of multi-turn dialogues, addressing the limitations of existing models that prioritize isolated correctness over conversational naturalness. This work represents a significant step forward in developing more empathetic and engaging audio interaction systems, setting a new standard for future research in audio language models.
Generating symphonic music requires simultaneously managing high-level structural form and dense, multi-track orchestration. Existing symbolic models often struggle with a "complexity-control imbalance", in which scaling bottlenecks limit long-term granular steerability. We present SymphonyGen, a 3D hierarchical framework for contemporary cinematic orchestration. SymphonyGen employs a cascading decoder architecture that decomposes the Bar, Track, and Event axes, improving computational efficiency and scalability over conventional 1D or 2D models. We introduce "short-score" conditioning via a beat-quantized multi-voice harmony skeleton, enabling outline control while preserving textural diversity. The model is further refined using Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward, aligning symbolic output with modern acoustic expectations. Additionally, we implement a dissonance-averse sampling algorithm to suppress unintended tonal clashes during inference. Objective evaluations show that both reinforcement learning and dissonance-averse sampling effectively enhance harmonic cleanliness while maintaining melodic expression. Subjective evaluations demonstrate that SymphonyGen outperforms baselines in musicality and preference for orchestral music generation. Demo page: https://symphonygen.github.io/
Primary: Central Conservatory of Music
All Institutions: Frontier Institute of Science and Technology, Central Conservatory of Music, Department of AI Music and Music Information Technology, Shenzhen University, Interdisciplinary Research Center
The main contribution of this paper is the development of SymphonyGen, a novel 3D hierarchical framework for orchestral music generation that effectively addresses the complexities of high-level structural form and dense orchestration. This work represents a substantial advancement in the field of AI music generation, combining innovative methodologies with rigorous evaluation to produce a system that aligns closely with modern acoustic expectations.
The paper introduces a 3D hierarchical architecture that effectively manages the complexities of orchestral music generation by decomposing the task into Bar, Track, and Event levels. This cascading decoder architecture enhances computational efficiency and scalability, which is a significant improvement over conventional models. The introduction of a "short-score" conditioning via a beat-quantized multi-voice harmony skeleton is innovative, allowing for greater control over the generated music while maintaining textural diversity. The use of Group Relative Policy Optimization (GRPO) with a cross-modal audio-perceptual reward is a novel approach that aligns the generated symbolic output with acoustic expectations, addressing the limitations of previous models. The dissonance-averse sampling algorithm further refines the output by suppressing unintended tonal clashes, showcasing a thoughtful integration of music theory into the generative process.
The experimental design is robust, featuring both objective and subjective evaluations. The use of a large dataset (SymphonyNet) for training and validation ensures that the model is well-tested across various orchestral styles. Objective metrics such as harmony precision, recall, and dissonance scores provide quantitative assessments of the model's performance, while subjective evaluations involving listener preferences add qualitative insights. The results indicate that SymphonyGen outperforms baseline models in terms of musicality and preference, particularly among general listeners, which is a strong endorsement of its effectiveness.
The paper provides detailed implementation information, including architecture specifications, training procedures, and evaluation metrics. However, the absence of a publicly available code repository limits reproducibility. The authors mention that implementation details will be available in their codebase, but without immediate access, it is challenging to fully assess reproducibility.
The paper acknowledges some limitations, such as the potential for "strange" harmonies or "noisy" segments in the generated music, which may stem from errors in harmony skeleton generation. Additionally, the subjective evaluations indicate that while the model performs well, it may still produce overly full orchestrations at times, suggesting room for improvement in balancing orchestration richness with clarity.
SymphonyGen has significant implications for the field of AI-assisted music composition, particularly in cinematic orchestration. By providing a controllable framework for composers, it enhances the collaborative potential between human creativity and AI-generated music. The model's ability to produce high-quality orchestral compositions could benefit various applications, including film scoring, video game music, and other multimedia projects, ultimately enriching the landscape of contemporary music creation. The main contribution of this paper is the development of SymphonyGen, a novel 3D hierarchical framework for orchestral music generation that effectively addresses the complexities of high-level structural form and dense orchestration. This work represents a substantial advancement in the field of AI music generation, combining innovative methodologies with rigorous evaluation to produce a system that aligns closely with modern acoustic expectations.
The joint training of speech enhancement and speaker embedding networks for speaker recognition is widely adopted under noisy acoustic environments. While effective, this paradigm often fails to leverage the generalization and robustness benefits inherent in large-scale speech enhancement pre-training. Moreover, maintaining the speaker information in the denoised speech is not an explicit objective of the speech enhancement process. To address these limitations, we proposed a scalable \textbf{U}Net-based \textbf{F}usion framework (UF-EMA) that considers the noisy and enhanced speech as a multi-channel input, thereby enabling the speaker encoder to exploit speaker information effectively. In addition, an \textbf{E}xponential \textbf{M}oving \textbf{A}verage strategy is applied to a speaker encoder pre-trained on clean speech to mitigate overfitting and facilitate a smooth transition from clean to noisy conditions. Experimental results on multiple noise-contaminated test sets showcase the superiority of the proposed approach.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University, Centre for Speech Technology Research, The University of Edinburgh, The University of Hong Kong
This paper presents a novel approach to speaker recognition in noisy environments by integrating a UNet-based fusion framework with an Exponential Moving Average strategy for speaker encoder adaptation. The technical contributions are well-founded and address critical challenges in the field, showcasing the potential for improved performance in practical applications.
The proposed methodology introduces a UNet-based fusion framework (UF-EMA) that effectively integrates noisy and enhanced speech signals to improve speaker recognition performance in noisy environments. The use of multi-channel input allows the speaker encoder to retain speaker-specific information, which is often lost in traditional approaches. The incorporation of an Exponential Moving Average strategy for updating the speaker encoder is a novel approach that addresses the challenges of overfitting and adaptation to varying noise conditions. The methodology is well-structured and provides a clear rationale for the design choices made, supported by a comprehensive theoretical background.
The experimental evaluation is robust, utilizing multiple noise-contaminated test sets to validate the proposed method's effectiveness. The results demonstrate a significant improvement in performance compared to existing methods, with lower Equal Error Rates (EER) across various conditions. The ablation studies provide insights into the contributions of individual components, reinforcing the effectiveness of the proposed fusion and EMA strategies. However, the paper could benefit from additional qualitative assessments, such as subjective listening tests, to complement the quantitative metrics.
The paper provides a detailed description of the experimental setup, including the datasets used (VoxCeleb1 and Vox1-O), the training process, and the evaluation metrics (EER). However, there is a lack of publicly available code or datasets, which may hinder reproducibility. Clear instructions for replicating the experiments would enhance the paper's impact.
One limitation is the reliance on pre-trained speech enhancement models, which may not be universally applicable across all domains or languages. Additionally, while the proposed method shows improvements in noisy conditions, it may still struggle in extreme noise scenarios or with overlapping speakers. The paper does not address potential computational costs associated with the proposed methods, which could affect real-time applications.
The proposed framework has significant implications for real-world applications in speaker recognition systems, particularly in environments with background noise, such as call centers, security systems, and personal assistants. By improving the robustness of speaker recognition, this research could enhance user experience and accessibility in various audio processing applications. This paper presents a novel approach to speaker recognition in noisy environments by integrating a UNet-based fusion framework with an Exponential Moving Average strategy for speaker encoder adaptation. The technical contributions are well-founded and address critical challenges in the field, showcasing the potential for improved performance in practical applications.
Recent audio-aware large language models (ALLMs) have demonstrated strong capabilities across diverse audio understanding and reasoning tasks, but they still frequently produce hallucinated or overly confident outputs. While uncertainty estimation has been extensively studied in text-only LLMs, it remains largely unexplored for ALLMs, where audio-conditioned generation introduces additional challenges such as perceptual ambiguity and cross-modal grounding. In this work, we present the first systematic empirical study of uncertainty estimation in ALLMs. We benchmark five representative methods, including predictive entropy, length-normalized entropy, semantic entropy, discrete semantic entropy, and P(True), across multiple models and diverse evaluation settings spanning general audio understanding, reasoning, hallucination detection, and unanswerable question answering. Our results reveal two key findings. First, semantic-level and verification-based methods consistently outperform token-level baselines on general audio reasoning benchmarks. Second, on trustworthiness-oriented benchmarks, the relative effectiveness of uncertainty methods becomes notably more model- and benchmark-dependent, indicating that conclusions drawn from general reasoning settings do not straightforwardly transfer to hallucination and unanswerable-question scenarios. We further explore uncertainty-based adaptive inference as a potential downstream application. We hope this study provides a foundation for future research on reliable, uncertainty-aware audio-language systems.
Primary: National Taiwan University
All Institutions: National Taiwan University, Artificial Intelligence Center of Research Excellence (AI-CoRE)
This paper makes a significant contribution by systematically evaluating uncertainty estimation methods in audio-aware large language models, revealing critical insights that could guide future research and applications in multimodal AI systems. The comprehensive benchmarking and analysis of methods provide a valuable foundation for improving the reliability of ALLMs in practical scenarios.
The paper presents a systematic empirical study of uncertainty estimation methods tailored for audio-aware large language models (ALLMs). It benchmarks five distinct methods, including predictive entropy and semantic entropy, across various models and tasks, highlighting the unique challenges posed by audio inputs. The methodology is sound, employing a two-stage protocol for uncertainty estimation and a clear comparative analysis across multiple benchmarks. However, the reliance on existing methods from text-based LLMs without significant adaptation for audio-specific challenges could be seen as a limitation.
The experiments are comprehensive, covering a wide range of benchmarks that assess both general audio understanding and trustworthiness-oriented tasks. The results indicate that semantic-level and verification-based methods consistently outperform token-level baselines, providing valuable insights into the performance of uncertainty estimation in ALLMs. The evaluation metrics, including AUROC and AURAC, are appropriate for the tasks at hand.
While the paper provides a detailed description of the experimental setup, including the models used and the evaluation protocols, it lacks specific implementation details or code availability, which could hinder reproducibility. The absence of a project URL further complicates this aspect.
The study primarily focuses on constrained answer spaces, which may not generalize well to open-ended tasks. Additionally, the uncertainty estimation methods are largely inherited from text LLM literature, potentially limiting their effectiveness in capturing audio-specific uncertainties. The fixed threshold for adaptive inference may not be optimal across all scenarios, and the study does not explore more sophisticated routing strategies.
The findings have significant implications for the development of more reliable audio-language systems, particularly in applications requiring robust uncertainty estimation for decision-making. The work lays a foundation for future research in uncertainty-aware models, which could enhance the safety and reliability of AI systems in high-stakes environments. This paper makes a significant contribution by systematically evaluating uncertainty estimation methods in audio-aware large language models, revealing critical insights that could guide future research and applications in multimodal AI systems. The comprehensive benchmarking and analysis of methods provide a valuable foundation for improving the reliability of ALLMs in practical scenarios.
To establish empathy with machines, it is essential to fully understand human emotional changes. However, research in multimodal emotion recognition often overlooks one problem: individual expressive traits vary significantly, which means that different people may express emotions differently. In our daily lives, we can see this. When communicating with different people, some express "happiness" through their facial expressions and words, while others may hide their happiness or express it through their actions. Both are expressions of 'happiness,' but such differences in emotional expression are still too difficult for machines to distinguish. Current emotion recognition remains at a 'static' level, using a single recognition model to identify all emotional styles. This "simplification" often affects the recognition results, especially in multi-turn dialogues. To address this problem, this paper introduces a novel Multi-Level Speaker Adaptive Network (ML-SAN), which, specifically, effectively addresses the challenge of speaker identity information confusion. ML-SAN does not simply assign a speaker's ID after recognition; instead, it employs a three-stage adaptive process: First, Input-level Calibration uses Feature-Level Linear Modulation (FiLM) to adjust the raw audio and visual features into a neutral space unrelated to the speaker. Then, Interaction-level Gating re-adjusts the trust level for each modality (e.g., voice or facial features) based on the speaker's identity information. Finally, Output-level Regularization maintains the consistency of speaker features in the latent space. Tests on the MELD and IEMOCAP datasets show that our model (ML-SAN) achieves better results, performs exceptionally well in handling challenging tail sentiment categories, and better addresses the diversity of speakers in real-world scenarios.
Primary: Xinjiang University
All Institutions: Joint Research Laboratory for Embodied Intelligence, Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, School of Computer Science and Technology, Xinjiang University
The main contribution of this paper is the introduction of the Multi-Level Speaker-Adaptive Network (ML-SAN), which effectively addresses speaker heterogeneity in multimodal emotion recognition through a novel three-stage adaptive process. This work represents a significant advancement in the field of emotion recognition by integrating speaker identity into the modeling process, thereby improving the accuracy and robustness of emotion detection in conversations.
The proposed ML-SAN framework introduces a three-stage adaptive process that effectively addresses the challenges of speaker identity confusion in emotion recognition. The use of Feature-wise Linear Modulation (FiLM) for input calibration, dynamic gating for interaction-level adjustments, and output regularization to maintain speaker identity showcases a thoughtful and innovative approach to handling multimodal data. This hierarchical adaptation strategy is a significant advancement over traditional speaker-agnostic methods, as it actively incorporates speaker characteristics into the model's decision-making process.
The experiments conducted on the MELD and IEMOCAP datasets demonstrate the effectiveness of the ML-SAN model, achieving superior performance compared to the baseline MultiEMO. The rigorous evaluation, including ablation studies to analyze the contribution of each component, adds credibility to the findings. The reported metrics, such as the weighted F1-score, indicate that the model performs well, particularly in challenging scenarios involving diverse emotional expressions.
The paper provides sufficient details regarding the experimental setup, including the use of specific datasets and the implementation of baseline models under identical conditions. However, the absence of a publicly accessible code repository limits the reproducibility of the results. Future work should consider making the code available to facilitate further research and validation.
While the ML-SAN model shows promising results, the paper acknowledges potential challenges in real-world applications, such as background noise and missing modalities. Additionally, the model's reliance on specific datasets may limit its generalizability to other contexts or languages. The authors should address these limitations in future iterations of their work.
The ability to accurately recognize emotions in conversations has significant implications for the development of empathetic AI systems. This research could enhance human-computer interaction in various applications, including virtual assistants, mental health support, and customer service. By improving emotion recognition, ML-SAN can contribute to more nuanced and effective communication between humans and machines. The main contribution of this paper is the introduction of the Multi-Level Speaker-Adaptive Network (ML-SAN), which effectively addresses speaker heterogeneity in multimodal emotion recognition through a novel three-stage adaptive process. This work represents a significant advancement in the field of emotion recognition by integrating speaker identity into the modeling process, thereby improving the accuracy and robustness of emotion detection in conversations.
Commercial TTS systems produce near-native Indic audio, but the best open-source bases (Chatterbox, Indic Parler-TTS, IndicF5) trail them on measured phonological dimensions, and the most widely adopted multilingual base (Chatterbox, 23 languages) does not even tokenise Telugu or Tamil. We ask: what is the minimum intervention that brings such a non-Indic-native base to commercial-class output on Telugu, Tamil, and Hindi, without training a new acoustic decoder and without any commercial TTS training data? We combine three pieces: (1) BUPS, a Brahmic Unified Phoneme Space that deterministically romanises seven Indic scripts to ISO-15919 so Chatterbox's Latin tokeniser can process them; (2) a LoRA adapter on only the text-token predictor (Chatterbox's t3), trained on ~1,220h of licensed Indic audio with a Hindi-proxy language_id; (3) a voice-prompt recovery recipe -- an 8-11s same-language reference clip plus three sampling overrides (exaggeration 0.7, temperature 0.6, min_p 0.1; "Config B") -- that recovers commercial-class acoustic output with no acoustic-decoder training. On Hindi, the LoRA regresses accuracy and we instead use vanilla Chatterbox + Config B, giving a two-branch deployment. Evaluated on 10-utterance pilot sets with the companion PSP benchmark, Praxy Voice matches or slightly leads commercial baselines: 26.7% retroflex collapse on Telugu (vs Sarvam Bulbul 33.3%), 71% Tamil-zha collapse (vs commercial trio's 86%), 0.025 LLM-WER on Hindi (tied with Cartesia Sonic-3). For intra-sentential code-mix we add a third branch (IndicF5 + native-script transliteration) that drops code-mix LLM-WER from 0.80-0.85 to 0.14-0.27 across Hi/Te/Ta. We release R6 LoRA weights (Apache-2.0), inference code and router (MIT), and a Gradio demo.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents a novel approach to adapting a frozen multilingual TTS model for Indic languages, demonstrating competitive performance against commercial systems while requiring minimal training data. The combination of BUPS, LoRA adaptation, and voice-prompt recovery represents a significant advancement in TTS technology, particularly for low-resource languages.
The methodology presented in the paper combines three innovative components: the Brahmic Unified Phoneme Space (BUPS) for romanisation of Indic scripts, a low-rank adaptation (LoRA) approach for the text-token predictor, and a voice-prompt recovery recipe that enhances acoustic output without retraining the acoustic decoder. This combination allows for effective adaptation of a frozen multilingual TTS model to support Indic languages, which is a significant advancement in TTS technology for low-resource languages. The approach is well-structured, addressing specific challenges in TTS for Indic languages and demonstrating a clear understanding of the limitations of existing systems.
The experimental evaluation is robust, utilizing a companion benchmark for assessing phonological accuracy and intelligibility across three Indic languages. The results indicate that the proposed system performs competitively against commercial baselines, particularly in terms of retroflex collapse and other phonological metrics. The use of a 10-utterance pilot set allows for initial validation, although the small sample size may limit statistical significance. The paper effectively communicates the results, providing detailed comparisons with existing systems.
The authors have made significant efforts to ensure reproducibility by releasing the LoRA weights, inference code, and a demo interface. However, the reliance on specific datasets and the complexity of the methods may pose challenges for complete replication without access to the same resources. The paper includes sufficient detail on the methodology and experimental setup to allow for independent verification of results.
The paper acknowledges several limitations, including the small sample size for pilot evaluations, the lack of formal Mean Opinion Score (MOS) testing, and the challenges faced in adapting the acoustic decoder. Additionally, the performance on Hindi with the LoRA adapter regressed accuracy, indicating that the method's effectiveness may vary across languages. The authors also note that the current implementation relies on reference audio clips, which may limit flexibility in practical applications.
This research has the potential to significantly impact the development of TTS systems for low-resource languages, particularly in India, where many languages are underrepresented in commercial TTS solutions. By providing a method that requires minimal training data and computational resources, the work could democratize access to high-quality TTS technology for Indic languages, fostering greater inclusivity in technology. The open-source release of the model and code further enhances its potential for widespread adoption and further research. The paper presents a novel approach to adapting a frozen multilingual TTS model for Indic languages, demonstrating competitive performance against commercial systems while requiring minimal training data. The combination of BUPS, LoRA adaptation, and voice-prompt recovery represents a significant advancement in TTS technology, particularly for low-resource languages.
This performance presents a duet between two intelligent musical instruments, SĂą (to trace back; to go upstream) and Agentier (playing on agentic clavier), and their human performers, connected through feedback loops. Rather than treating AI as a tool that responds predictably to input, both systems operate recursively, where past actions continuously influence future behaviour. The SĂą operates in the audio space through latent representation. Its performer uses Make Noise 0-series synthesisers and MIDI controllers to work with a neural feedback synthesis system based on a RAVE model, with a latent feedback loop embedded within the model's internal structure. This allows the instrument to remember and reuse its own internal states, influencing ongoing sound generation through its recent sonic history. The Agentier functions in the control space. Its performer interacts with the system using a Roland S-1 synthesiser and Keith McMillen QuNeo touchpad, where control gestures are routed into a recurrent neural network that feeds back into the synthesis process. Through this feedback loop, the system actively shapes the evolution of control signals over time. Contrasting feedback in the audio and control domains, the performance explores shared agency, resistance, and negotiation between humans and intelligent musical systems. Musical phenomena are co-produced through the entangled states of interaction, rather than through pre-existing system configuration or fixed mappings.
Primary: The Australian National University
All Institutions: The Australian National University
This paper presents a significant contribution to the field of AI in music by exploring the co-constructive relationship between human performers and intelligent musical instruments through innovative feedback mechanisms. The methodology is well-defined, though the lack of rigorous experimental evaluation and reproducibility details limits its impact.
The paper presents a novel approach to musical performance through the integration of AI in two intelligent musical instruments, SĂą and Agentier. The methodology is well-articulated, detailing the use of a RAVE model for audio synthesis and a recurrent neural network for control signal generation. The recursive feedback mechanisms employed in both instruments are innovative, allowing for a dynamic interaction between the performer and the instrument, which enhances the creative process. The use of latent representations and direct manipulation of latent dimensions is particularly noteworthy, as it provides performers with greater control over the sonic output.
While the paper describes the performance setup and the interaction between the instruments, it lacks a comprehensive experimental evaluation with quantitative metrics. The authors mention a video documentation of a performance, which serves as a qualitative demonstration of their approach. However, there is no detailed analysis of the performance outcomes, such as audience reception or systematic comparisons with traditional instruments or other AI-enabled systems. Including metrics like Mean Opinion Score (MOS) or other objective evaluations would strengthen the claims made.
The paper provides a clear description of the instruments and the technology used, which aids in reproducibility. However, specific implementation details, such as the exact configurations of the neural networks and the training datasets, are not sufficiently detailed. Additionally, the lack of a publicly available code repository limits the ability of other researchers to replicate the work fully.
One of the main limitations is the absence of a rigorous experimental evaluation framework to assess the performance of the instruments quantitatively. The reliance on qualitative descriptions and a single performance video may not provide a comprehensive understanding of the instruments' capabilities. Furthermore, the paper does not address potential issues related to latency in real-time performance, which could affect the interaction quality between the performer and the AI systems.
The integration of AI in musical performance has significant implications for the future of music creation and performance. This work encourages a rethinking of the role of the performer and the instrument, promoting a collaborative relationship that could lead to new forms of musical expression. The exploration of feedback loops and shared agency could inspire further research in both music technology and human-computer interaction, potentially influencing the design of future intelligent musical instruments. This paper presents a significant contribution to the field of AI in music by exploring the co-constructive relationship between human performers and intelligent musical instruments through innovative feedback mechanisms. The methodology is well-defined, though the lack of rigorous experimental evaluation and reproducibility details limits its impact.
Large Audio-Language Models show consistent performance gains across speech and audio benchmarks, yet high scores may not reflect true auditory perception. If a model can answer questions without processing the acoustic signal, the benchmark fails as a measure of auditory understanding. We present a diagnostic framework using two axes: text prior, which measures answerability from text and general knowledge alone, and audio reliance, which assesses actual dependency on the acoustic signal. Evaluating eight LALMs across three benchmarks, we find that models retain 60-72% of their full audio scores even without any audio input. Moreover, among items that require audio, only 3.0-4.2% need the complete audio clip; the majority can be resolved using localized fragments. These findings challenge the assumption that benchmark performance equals robust audio understanding, and we conclude with practical guidelines for improving evaluation reliability and benchmark design.
Primary: National Taiwan University
All Institutions: National Taiwan University, NTU Artificial Intelligence Center of Research Excellence
The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.
The paper introduces a novel diagnostic framework that assesses audio-language models (LALMs) based on two axes: text prior and audio reliance. This dual-axis approach allows for a nuanced understanding of how much of a model's performance can be attributed to textual cues versus actual audio processing. The methodology is well-structured, employing controlled settings to quantify the text prior and audio reliance, which is a significant advancement in evaluating LALMs. The use of multiple benchmarks and a variety of models strengthens the robustness of the findings.
The experiments are thorough, evaluating eight LALMs across three distinct benchmarks. The results indicate a substantial grounding gap, revealing that models can achieve high scores without audio input, which challenges the assumption of robust auditory understanding. The analysis of performance retention with partial audio is particularly insightful, providing a clear picture of how audio information is utilized by the models. However, the paper could benefit from more detailed statistical analysis to support its claims.
The paper provides a clear description of the experimental setup, including the models used and the evaluation protocols. However, it lacks specific URLs or repositories for code and data, which could hinder reproducibility. Including such resources would enhance the paper's impact and facilitate further research in this area.
One limitation is the reliance on existing benchmarks, which may not fully capture the complexities of audio understanding. Additionally, while the study identifies issues with current benchmarks, it does not propose new benchmarks or datasets, which could be a missed opportunity for advancing the field. The findings may also be limited by the specific models and benchmarks chosen for evaluation.
The findings have significant implications for the design of future audio-language benchmarks and the evaluation of LALMs. By highlighting the potential for models to rely on textual priors rather than genuine auditory understanding, the paper calls for a reevaluation of how auditory capabilities are assessed in machine learning. This could lead to more accurate and reliable evaluations, ultimately improving the development of models that genuinely understand audio. The paper presents a critical analysis of the reliance on audio in audio-language models, challenging existing benchmarks and proposing a framework for better evaluation. The methodology and findings are significant, offering valuable insights for researchers and practitioners in the field of machine learning and audio understanding.
Automatic speech recognition systems often produce confident yet incorrect transcriptions under noisy or ambiguous conditions, which can be misleading for both users and downstream applications. Standard evaluation based on Word Error Rate focuses solely on accuracy and fails to capture transcription reliability. We introduce an abstention-aware transcription framework that enables ASR models to explicitly abstain from uncertain segments. To evaluate reliability under abstention, we propose RAS, a reliability-oriented metric that balances transcription informativeness and error aversion, with its trade-off parameter calibrated by human preference. We then train an abstention-aware ASR model through supervised bootstrapping followed by reinforcement learning. Our experiments demonstrate substantial improvements in transcription reliability while maintaining competitive accuracy.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing
The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.
The paper introduces a novel abstention-aware transcription framework for ASR systems, which allows models to abstain from uncertain segments rather than producing potentially misleading transcriptions. The proposed Reliability-Aware Score (RAS) metric is innovative as it integrates a placeholder for uncertainty directly into the transcription process, moving beyond traditional metrics like Word Error Rate (WER). The methodology is well-structured, employing a two-stage training pipeline that combines supervised bootstrapping and reinforcement learning, effectively enhancing the model's reliability in challenging acoustic conditions.
The experiments are comprehensive, utilizing two datasets (LibriSpeech and TALCS) to evaluate the proposed method under both clean and noisy conditions. The results demonstrate significant improvements in transcription reliability, particularly in adverse environments, while maintaining competitive accuracy. The use of human preference alignment for calibrating the RAS metric adds robustness to the evaluation process, ensuring that the proposed framework is grounded in real-world applicability.
The paper provides detailed descriptions of the methodology, including the training pipeline and experimental setup. However, there is a lack of supplementary material or code repositories that would facilitate complete reproducibility. The absence of a project URL limits the ability for other researchers to replicate the findings directly.
While the proposed framework shows promise, the reliance on human preference data for calibrating the RAS metric may introduce biases based on the specific population sampled. Additionally, the performance in highly diverse acoustic environments beyond those tested (e.g., different languages or dialects) remains unaddressed, which could limit the generalizability of the findings.
The approach has significant implications for high-stakes applications of ASR, such as medical and legal transcription, where reliability is critical. By providing a mechanism for models to indicate uncertainty, the framework can enhance user trust and improve decision-making processes in various domains. The introduction of RAS as a new evaluation metric could also pave the way for further research into reliable ASR systems. The paper presents a significant advancement in automatic speech recognition by introducing an abstention-aware framework and a novel reliability metric, RAS, which enhances the reliability of ASR outputs in uncertain conditions. The methodology is well-founded and the experimental results robustly support the proposed contributions, marking a meaningful step forward in the field of speech processing.
We propose Speech Enhancement based on Drifting Models (DriftSE), a novel generative framework that formulates denoising as an equilibrium problem. Rather than relying on iterative sampling, DriftSE natively achieves one-step inference by evolving the pushforward distribution of a mapping function to directly match the clean speech distribution. This evolution is driven by a Drifting Field, a learned correction vector that guides samples toward the high-density regions of the clean distribution, which naturally facilitates training on unpaired data by matching distributions rather than paired samples. We investigate the framework under two formulations: a direct mapping from the noisy observation, and a stochastic conditional generative model from a Gaussian prior. Experiments on the VoiceBank-DEMAND benchmark demonstrate that DriftSE achieves high-fidelity enhancement in a single step, outperforming multi-step diffusion baselines and establishing a new paradigm for speech enhancement.
Primary: Victoria University of Wellington
All Institutions: Victoria University of Wellington, GN Audio A/S
The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.
The proposed method, DriftSE, innovatively formulates speech enhancement as an equilibrium problem, leveraging a learned Drifting Field for one-step inference. This approach diverges from traditional iterative sampling techniques, providing a significant computational advantage. The use of a semantic latent space for drift computation enhances the model's ability to capture complex speech structures, which is a notable improvement over existing methods. The dual formulation of the model—direct mapping and conditional generation—adds flexibility and robustness to the framework, allowing it to adapt to various scenarios, including unpaired training.
The experiments conducted on the VoiceBank-DEMAND benchmark and the DNS Challenge 2020 blind test set showcase the effectiveness of DriftSE in achieving high-fidelity speech enhancement. The reported metrics (PESQ, SI-SDR, SCOREQ) indicate that DriftSE outperforms both multi-step diffusion models and other one-step approaches, establishing its competitive edge. The thorough evaluation across different datasets and conditions demonstrates the model's generalization capabilities, which is crucial for real-world applications.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and hyperparameter settings, which are essential for reproducibility. However, the absence of a public code repository or demo URL limits the accessibility of the method for further validation by the research community.
While the DriftSE framework shows promising results, its reliance on a pre-trained self-supervised learning encoder may introduce limitations related to the quality and representativeness of the latent features. Additionally, the performance drop in unpaired settings suggests that the model may struggle in scenarios where clean-reference data is not available, highlighting a potential area for improvement.
The DriftSE framework has significant implications for real-time speech enhancement applications, particularly in environments with varying noise conditions. Its ability to perform one-step inference could facilitate deployment in low-latency scenarios, such as telecommunication and assistive technologies. Furthermore, the methodology could inspire future research in generative modeling and distribution matching across other domains beyond audio. The main contribution of this paper is the introduction of DriftSE, a novel generative framework for speech enhancement that reformulates denoising as an equilibrium problem, achieving high-fidelity results in a single inference step. This work represents a significant advancement in the field of speech enhancement, combining innovative methodology with robust experimental validation to address critical challenges in real-time applications.
Automated movie creation requires coordinating multiple characters, modalities, and narrative elements across extended sequences -- a challenge that existing end-to-end approaches struggle to address effectively. We present \textbf{CineAGI}, a hierarchical movie generation framework that decomposes this complex task through specialized multi-agent orchestration. Our framework employs three key innovations: (1) a multi-agent narrative synthesis module where specialized LLM agents collaboratively generate comprehensive cinematic blueprints with character profiles, scene descriptions, and cross-modal specifications; (2) a decoupled character-centric pipeline that maintains identity consistency through instance-level tracking and integration while enabling flexible multi-character composition; and (3) a hierarchical audio-visual synchronization mechanism ensuring frame-level alignment of dialogue, expressions, and music. Extensive experiments demonstrate that CineAGI achieves 40\% improvement in overall consistency, 4.4\% gain in subject consistency, 5.4\% enhancement in aesthetic quality, and 28.7\% higher character consistency compared to baselines. Our work establishes a principled foundation for automated multi-scene video generation that preserves narrative coherence and character authenticity.
Primary: Nanjing University
All Institutions: Nanjing University, Zhejiang Sci-Tech University, University of British Columbia, Beijing Shuzhimei Technology Co., Ltd, Jilin University, Tianjin University
CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.
The methodology presented in CineAGI is robust and innovative, leveraging a hierarchical multi-agent orchestration approach to tackle the complex task of automated movie creation. The use of specialized LLM agents for narrative synthesis, character generation, and cinematographic synthesis is a significant advancement over traditional end-to-end models. The framework's ability to maintain character consistency and narrative coherence across scenes through decoupled processing and explicit synchronization mechanisms is particularly noteworthy. The detailed breakdown of each module and the integration of various generative models demonstrate a comprehensive understanding of the challenges in automated filmmaking.
The experimental evaluation is thorough, utilizing a diverse benchmark of 100 story prompts across multiple genres to assess the framework's performance. The use of both quantitative metrics and qualitative human evaluations provides a well-rounded perspective on the system's effectiveness. The reported improvements in consistency and aesthetic quality are substantial, indicating that the proposed methods yield significant enhancements over existing baselines. However, the paper could benefit from more detailed comparisons with a wider range of contemporary methods to further contextualize its contributions.
The paper provides a detailed description of the experimental setup, including generation settings, evaluation metrics, and baseline comparisons. However, the lack of publicly available code or demo URLs limits reproducibility. Future work should consider releasing the implementation to facilitate further research and validation by the community.
One limitation of the study is the reliance on specific generative models, which may not generalize across all contexts or genres of filmmaking. Additionally, while the framework shows improvements in character consistency and narrative coherence, the complexity of the system may introduce challenges in real-time applications or scalability. The computational cost of approximately 11.3 minutes per scene on a single GPU could also be a barrier for broader adoption.
The implications of CineAGI extend beyond academic research into practical applications in the film and entertainment industry. By automating aspects of movie creation, this framework could democratize content production, enabling creators with limited resources to produce high-quality narratives. Furthermore, the integration of AI in creative processes raises questions about authorship and the role of human creativity in storytelling. CineAGI represents a significant advancement in automated movie creation through its innovative multi-agent orchestration framework. The comprehensive methodology and substantial experimental validation establish it as a leading approach in the field, with the potential to reshape how narratives are crafted in digital media.
Recent large audio language models (LALMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet incur high inference costs. Token compression is an effective method that directly reduces redundant tokens in the sequence. Existing compression methods usually assume that all attention heads in LALMs contribute equally to various audio tasks and calculate token importance by averaging scores across all heads. However, our analysis demonstrates that attention heads exhibit distinct behaviors across diverse audio domains. We further reveal that only a sparse subset of attention heads actively responds to audio, with completely different performance when handling semantic and acoustic tasks. In light of this observation, we propose HeadRouter, a head-importance-aware token pruning method that perceives the varying importance of attention heads in different audio tasks to maximize the retention of crucial tokens. HeadRouter is training-free and can be applied to various LALMs. Extensive experiments on the AudioMarathon and MMAU-Pro benchmarks demonstrate that HeadRouter achieves state-of-the-art compression performance, exceeding the baseline model even when retaining 70% of the audio tokens and achieving 101.8% and 103.0% of the vanilla average on Qwen2.5-Omni-3B and Qwen2.5-Omni-7B, respectively.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, DAIL Tech, Northeastern University, Sichuan University, Huazhong University of Science and Technology
The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.
The proposed HeadRouter method introduces a novel dynamic head-weight routing mechanism that adapts to the varying importance of attention heads in large audio language models (LALMs). This approach is innovative in its use of entropy-based selectivity scores and Gaussian soft mixing to create task-specific head-weight profiles. The training-free nature of the method allows it to be easily integrated into existing models without additional training overhead, which is a significant advantage for practical applications.
The experiments conducted on the AudioMarathon and MMAU-Pro benchmarks demonstrate the effectiveness of HeadRouter in outperforming existing token pruning methods across various audio tasks. The results indicate that the method not only maintains performance while aggressively pruning tokens but also adapts well to different audio contexts, showcasing its robustness. The comparative analysis with state-of-the-art methods further validates the proposed approach's superiority in managing token importance dynamically.
The paper provides a clear description of the methodology, including the routing mechanism and evaluation setup, which supports reproducibility. However, the lack of publicly available code or detailed implementation guidelines may hinder full reproducibility for other researchers.
One limitation is the reliance on pre-calibrated head-weight profiles, which may not generalize across all audio tasks or models. Additionally, while the method shows promise in reducing computational costs, the paper does not explore the implications of using HeadRouter in real-time applications or its impact on latency in practical deployments.
The implications of this research extend to various applications in audio processing, including speech recognition, music analysis, and multimodal systems. By improving the efficiency of LALMs, this work could facilitate more widespread adoption of advanced audio understanding technologies in real-time applications, enhancing user experiences in voice-interactive systems. The main contribution of this paper is the introduction of HeadRouter, a dynamic head-weight routing mechanism for audio token pruning in large audio language models, which significantly enhances performance and efficiency in processing diverse audio tasks. This work represents a meaningful advancement in the field of audio language models, addressing critical challenges in token management and model efficiency while maintaining high performance across various audio tasks.
Machine generation of symbolic music and digital audio are hot topics but there have been relatively few digital musical instruments that integrate generative AI. Present musical AI tools are not artist centred and do not support experimentation or integrating into musical instruments or practices. This work introduces an inexpensive generative AI instrument platform based on a single board computer that connects via MIDI to other musical devices. The platform uses artist-collected datasets with models trained on a regular computer. This paper asks what the design space of intelligent musical instruments might look like when accessible and portable AI systems are available for artistic exploration. I contribute five examples of instruments created and tested through a two-year first-person artistic research process. These show that (re)mapping can replace retraining for discovering AI interaction, that fast input interleaving is a new co-creative strategy, that small-data AI models can be a transportable design resource, and that cheap hardware can lower barriers to inclusion. This work could enable artists to explore new interaction and performance schemes with intelligent musical instruments.
Primary: The Australian National University
All Institutions: The Australian National University
This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.
The methodology is grounded in a first-person artistic research approach, which is innovative in the context of generative AI in music. The use of small-data AI models trained on artist-collected datasets is a significant contribution, allowing for a more personalized and artist-centered exploration of generative AI in musical contexts. The paper effectively outlines the design and implementation of a generative AI platform that integrates with existing musical instruments, showcasing a practical application of AI in music performance. The iterative development of five distinct instruments provides a rich qualitative dataset for analysis.
The experiments conducted over two years of performance practice are well-documented, providing insights into the evolution of the instruments and their interactions with musicians. The author details the performance experiences and the adaptability of the instruments in various contexts, which adds depth to the evaluation. However, the paper lacks quantitative metrics for assessing the performance of the AI models, which could strengthen the evaluation of their effectiveness.
The implementation details are provided, including the use of Raspberry Pi and the open-source nature of the software, which enhances reproducibility. The availability of the project on GitHub allows others to replicate the setup and experiment with the platform. However, more detailed instructions on the configuration and training processes would further aid reproducibility.
The study is limited by its first-person perspective, which may not capture the full range of experiences from diverse musicians. Additionally, the exploration of model updates over time is not systematically addressed, which could provide further insights into the adaptability and longevity of the AI models in performance settings.
This work has the potential to democratize access to intelligent musical instruments by lowering the cost barrier and encouraging experimentation among artists. The findings could influence future designs of musical AI systems, promoting a shift towards artist-centered approaches in generative AI applications. The implications for HCI and music technology communities are significant, as the research opens new avenues for interaction and collaboration between humans and AI in creative practices. This paper presents a novel generative AI platform for intelligent musical instruments, emphasizing artist-centered design and small-data approaches. The comprehensive exploration of performance experiences and instrument development contributes valuable insights to the intersection of AI and music, highlighting the potential for innovative co-creative practices.
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed \textit{RTCFake}, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn platform-invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross-platform generalization and noise robustness, offering an effective and generalizable modeling paradigm. The \textit{RTCFake} dataset is provided in the {https://huggingface.co/datasets/JunXueTech/RTCFake}.
Primary: unknown
All Institutions: unknown
The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.
The paper introduces a phoneme-guided consistency learning (PCL) strategy, which is a novel approach aimed at enhancing the robustness of speech deepfake detection in real-time communication scenarios. The proposed methodology effectively addresses the challenges posed by various distortions and codec compressions encountered in RTC environments. The dataset, RTCFake, is a significant contribution, as it is specifically designed for the complexities of real-time communication, which is often overlooked in existing literature.
The authors provide a comprehensive evaluation of their proposed method using a large-scale dataset of approximately 600 hours of speech. The evaluation set includes both unseen RTC platforms and complex noise conditions, which enhances the realism of the testing environment. The reported improvements in cross-platform generalization and noise robustness are significant, indicating that the proposed method is effective in practical applications.
While the paper mentions the availability of the RTCFake dataset on Hugging Face, it lacks detailed implementation specifics regarding the PCL strategy and the models used. This omission could hinder reproducibility, as other researchers may struggle to replicate the results without clear guidance on the experimental setup.
One limitation is that the dataset may not encompass all possible real-time communication scenarios, potentially limiting the generalizability of the findings. Additionally, the paper does not address the computational efficiency of the proposed method, which is crucial for real-time applications.
The implications of this research are significant, as it addresses a pressing issue in the age of deepfake technology. The ability to detect speech deepfakes in real-time communication can have far-reaching effects on security, privacy, and trust in digital communications. The proposed dataset and methodology could serve as a foundation for future research in this area. The paper presents RTCFake, a novel dataset and a phoneme-guided consistency learning strategy for detecting speech deepfakes in real-time communication, addressing a critical gap in existing research. The methodology is innovative, and the experimental results demonstrate substantial improvements, making it a valuable contribution to the field of audio and speech processing.