To preserve or not to preserve prosody is a central question in voice anonymization. Prosody conveys meaning and affect, yet is tightly coupled with speaker identity. Existing methods either discard prosody for privacy or lack a principled mechanism to control the utility-privacy trade-off, operating at fixed design points. We propose DiffAnon, a diffusion-based anonymization method with classifier-free guidance (CFG) that provides explicit, continuous inference-time control over prosody preservation. DiffAnon refines acoustic detail over semantic embeddings of an RVQ codec, enabling smooth interpolation between anonymization strength and prosodic fidelity within a single model. To the best of our knowledge, it is the first voice anonymization framework to provide structured, interpolatable inference-time prosody control. Experiments demonstrate structured trade-off behavior, achieving strong utility while maintaining competitive privacy across controllable operating points.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Center for Language and Speech Processing, Human Language Technology Center of Excellence (COE)
The main contribution of this paper is the introduction of DiffAnon, a diffusion-based voice anonymization framework that enables explicit and continuous control over prosody preservation, significantly advancing the field of privacy-preserving speech technologies. This work represents a meaningful step forward in balancing the utility-privacy trade-off in voice applications, showcasing the potential for structured prosody control in enhancing both privacy and expressiveness in anonymized speech.
The proposed methodology, DiffAnon, leverages a novel diffusion-based framework with classifier-free guidance to provide continuous control over prosody preservation in voice anonymization. This approach is innovative as it allows for the modulation of the utility-privacy trade-off in a structured manner, which is a significant advancement over existing methods that operate at fixed points. The integration of semantic embeddings from an RVQ codec with a diffusion model is particularly noteworthy, as it combines strengths from both domains to enhance the quality of anonymized speech.
The experiments are robust, utilizing the VoicePrivacy Challenge 2024 protocol, which provides a standardized framework for evaluating privacy and utility. The results demonstrate that DiffAnon achieves competitive performance across various metrics, including EER for privacy and WER for content preservation, while also showing a clear trade-off between privacy and prosodic fidelity. The systematic evaluation across different prosody guidance weights adds depth to the findings.
The authors have made their code and pretrained models publicly available, which is a strong point for reproducibility. The detailed training and inference setup, including hyperparameters and datasets used, further supports replicability of the results.
While the paper presents a significant advancement, it does not explore the potential impact of varying speaker characteristics on the performance of the model. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other languages or dialects. The paper also does not address the computational costs associated with training and deploying the model in real-world applications.
The ability to anonymize voice while preserving prosody has significant implications for privacy in various applications, including telecommunication, virtual assistants, and voice-based interactions. This work could enhance user trust in voice technologies by providing a means to protect identity while maintaining communicative effectiveness. The structured control over prosody could also lead to advancements in emotional speech synthesis and human-computer interaction. The main contribution of this paper is the introduction of DiffAnon, a diffusion-based voice anonymization framework that enables explicit and continuous control over prosody preservation, significantly advancing the field of privacy-preserving speech technologies. This work represents a meaningful step forward in balancing the utility-privacy trade-off in voice applications, showcasing the potential for structured prosody control in enhancing both privacy and expressiveness in anonymized speech.
We introduce a toolkit for uncovering spurious correlations between recording characteristics and target class in speech datasets. Spurious correlations may arise due to heterogeneous recording conditions, a common scenario for health-related datasets. When present both in the training and test data, these correlations result in an overestimation of the system performance -- a dangerous situation, specially in high-stakes application where systems are required to satisfy minimum performance requirements. Our toolkit implements a diagnostic method based on the detection of the target class using only the non-speech regions in the audio. Better than chance performance at this task indicates that information about the target class can be extracted from the non-speech regions, flagging the presence of spurious correlations. The toolkit is publicly available for research use.
Primary: Instituto de Investigación en Ciencias de la Computación
All Institutions: Instituto de Investigación en Ciencias de la Computación, Departamento de Computación, Facultad de Ciencias Exactas y Naturales, Facultad de Medicina, Centro de Neurociencias Cognitivas, Universidad de Chile, Universidad de San Andrés
The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.
The methodology presented in the paper is robust and well-structured, focusing on the detection of spurious correlations in speech datasets. The authors introduce a systematic approach that leverages non-speech regions of audio to diagnose potential biases in datasets, which is a significant advancement in ensuring the reliability of machine learning models in high-stakes applications. The toolkit's design, which includes careful selection of voice-activity detection systems and feature extraction methods, demonstrates a thorough understanding of the challenges posed by spurious correlations.
The experiments conducted on two Alzheimer's disease speech datasets are comprehensive and well-executed. The authors provide a detailed analysis of the performance of their method against various configurations, including different feature extraction techniques and VAD systems. The use of statistical significance testing adds rigor to their findings, although the reliance on specific datasets may limit generalizability.
The paper offers a clear description of the experimental setup and the toolkit's implementation, which is publicly available on GitHub. This enhances reproducibility, as other researchers can apply the same methods to their datasets. However, the paper could benefit from more detailed instructions on how to utilize the toolkit effectively.
One limitation of the study is the potential overfitting to the specific datasets used for evaluation, which may not represent the broader spectrum of speech datasets. Additionally, while the toolkit addresses spurious correlations, it does not provide solutions for all possible biases that may arise in speech data collection.
The implications of this research are significant, particularly in the context of health-related machine learning applications where spurious correlations can lead to harmful consequences. The toolkit can serve as a critical resource for researchers and practitioners in the field, promoting more reliable and ethical use of speech datasets in machine learning. The paper introduces a novel toolkit for detecting spurious correlations in speech datasets, addressing a critical issue in machine learning applications. The technical contributions and methodology are well-articulated, providing valuable insights into the reliability of speech-based models, particularly in high-stakes scenarios.
Cross-lingual speaker verification suffers from severe language-speaker entanglement. This causes systematic degradation in the hardest scenario: correctly accepting utterances from the same speaker across different languages while rejecting those from different speakers sharing the same language. Standard adversarial disentanglement degrades speaker discriminability; blind discriminators inadvertently penalize speaker-discriminative traits that merely correlate with language. To address this, we propose Dual-LoRA, injecting trainable task-factorized LoRA adapters into a frozen pre-trained backbone. Our core innovation is a Language-Anchored Adversary: by grounding the discriminator with an explicit language branch, adversarial gradients target true linguistic cues rather than arbitrary correlations, preserving essential speaker characteristics. Evaluated on the TidyVoice benchmark, our system achieves a 0.91% validation EER and achieves 3rd place in the official challenge.
Primary: Nanjing University
All Institutions: Nanjing University, AISpeech Co, Jiangsu Key Lab of Language Computing, MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Soul AI Lab
The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.
The methodology presented in the paper is innovative, particularly in its use of Dual-LoRA, which introduces a parameter-efficient approach to disentangle language and speaker identity in cross-lingual speaker verification. The architecture's design, which incorporates two parallel LoRA streams and a Language-Anchored Adversary, is well-justified and addresses key challenges in the field. The decision to keep the backbone frozen while adapting only the LoRA modules is a strategic choice that enhances the model's efficiency and effectiveness.
The experiments conducted on the TidyVoice benchmark are robust, with a clear focus on evaluating the proposed framework against established baselines. The use of multiple backbones and the systematic analysis of different configurations provide strong evidence for the effectiveness of the Dual-LoRA approach. The reported results, including the significant reduction in EER, particularly in challenging scenarios, underscore the practical impact of the proposed method.
The paper provides sufficient implementation details, including the architecture, training procedures, and hyperparameters, which facilitate reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for others to replicate the results independently.
One notable limitation is the reliance on a single benchmark dataset (TidyVoice) for evaluation, which may not fully capture the generalizability of the proposed method across diverse real-world scenarios. Additionally, while the paper addresses the issue of language-speaker entanglement, it does not explore potential biases that may arise from the training data or the implications of using specific languages.
The proposed Dual-LoRA framework has the potential to significantly enhance cross-lingual speaker verification systems, making them more effective for applications in voice authentication and personalization across different languages. This advancement could lead to broader adoption of voice-based technologies in multilingual contexts, improving accessibility and user experience. The paper presents Dual-LoRA, an innovative framework for cross-lingual speaker verification that effectively disentangles language and speaker identity, achieving notable performance improvements on benchmark evaluations. The comprehensive methodology and rigorous experimental validation contribute significantly to the field, addressing a critical challenge in speaker verification systems.
Autoregressive (AR) models with diffusion heads have recently achieved strong text-to-audio performance, yet their iterative decoding and multi-step sampling process introduce high-latency issues. To address this bottleneck, we propose a one-step sampling framework that combines an energy-distance training objective with representation-level distillation. An energy-scoring head maps Gaussian noise directly to audio latents in one step, eliminating the need for a costly recursive diffusion sampling process, while distillation from a masked autoregressive (MAR) text-to-audio model preserves the strong conditioning learned during diffusion training. On the AudioCaps benchmark, our method consistently outperforms prior one-step baselines such as ConsistencyTTA, SoundCTM, AudioLCM and AudioTurbo, on both objective and subjective metrics, while substantially narrowing the quality gap to AR diffusion systems with multi-step sampling. Compared to the state-of-the-art AR diffusion system, IMPACT, our approach achieves up to $8.5$x faster batch inference with highly competitive audio quality. These results demonstrate that combining energy-distance training with representation-level distillation provides an effective recipe for fast, high-quality text-to-audio synthesis.
Primary: Amazon AGI
All Institutions: Amazon AGI, National Taiwan University
The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.
The proposed methodology introduces a novel one-step sampling framework for text-to-audio generation that integrates an energy-distance training objective with representation-level distillation. This approach effectively reduces inference latency while maintaining audio quality, addressing a significant limitation in existing autoregressive models that rely on multi-step sampling. The use of energy-scoring to map Gaussian noise directly to audio latents is innovative and demonstrates a clear departure from traditional diffusion-based methods. The incorporation of distillation from a masked autoregressive model further enhances the model's performance, showcasing a thoughtful combination of techniques to achieve rapid and high-quality audio synthesis.
The experimental evaluation is comprehensive, utilizing the AudioCaps benchmark for both objective and subjective assessments. The paper reports consistent improvements over existing one-step baselines, with significant gains in fidelity and semantic relevance as measured by various metrics (FD, FAD, KL, IS, CLAP). The results demonstrate not only superior performance compared to prior models but also a substantial reduction in inference time, achieving up to 8.5 times faster batch inference than the state-of-the-art AR diffusion system, IMPACT. The thoroughness of the experiments, including ablation studies on representation distillation and classifier-free guidance, adds credibility to the findings.
The paper provides detailed descriptions of the experimental setup, including datasets, model configurations, and evaluation metrics, which contribute to reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability of other researchers to replicate the results directly. Clear documentation of hyperparameters and training procedures is essential for future work in this area.
While the proposed method shows promising results, it still falls short of the audio quality achieved by multi-step diffusion models, indicating that there may be inherent trade-offs between speed and fidelity. The reliance on a single sampling step may also limit the model's flexibility in generating more complex audio sequences. Additionally, the paper does not address potential biases in the training datasets, which could affect the generalizability of the model.
The advancements in low-latency text-to-audio generation have significant implications for real-time applications in multimedia content creation, interactive media, and personalized audio experiences. The ability to generate high-quality audio quickly opens up new avenues for user engagement and creative expression. Furthermore, the integration of energy-distance training and representation distillation could inspire future research in other generative tasks across different modalities. The paper presents a significant advancement in efficient generative media by introducing a one-step sampling framework that achieves substantially faster inference while maintaining high audio fidelity and semantic relevance. The innovative combination of energy-distance training and representation distillation represents a meaningful contribution to the field of machine learning, particularly in audio generation.
In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.
Primary: Fudan University
All Institutions: Fudan University, ByteDance
The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.
The methodology presented in GaMMA is robust, utilizing a dual-encoder architecture that effectively captures both temporal and non-temporal aspects of music understanding. The mixture-of-experts approach, combined with a three-stage training strategy (pretraining, supervised fine-tuning, and reinforcement learning), is innovative and addresses existing gaps in music LMMs. The introduction of MusicBench as a comprehensive benchmark for evaluating music understanding adds significant value to the methodology, allowing for a nuanced assessment of model capabilities.
The experiments conducted demonstrate the effectiveness of GaMMA, achieving state-of-the-art results on multiple benchmarks, including MusicBench and MuChoMusic. The extensive evaluation across various dimensions of music understanding, including temporal reasoning and global attributes, showcases the model's capabilities. The use of human-curated questions in MusicBench enhances the credibility of the results, though the paper could benefit from more extensive comparisons with a wider range of existing models.
The paper provides detailed implementation specifics, including training strategies, hyperparameters, and data curation processes, which are essential for reproducibility. However, the absence of publicly available code or datasets limits the ability for independent verification of results.
One limitation is the reliance on curated datasets, which may introduce biases or limit the generalizability of the model. Additionally, while the dual-encoder approach is innovative, it may require significant computational resources, which could hinder accessibility for broader research applications.
GaMMA has the potential to significantly impact the field of music understanding and multimodal AI by providing a framework that can be adapted for various applications, such as music recommendation systems, educational tools, and interactive music assistants. Its ability to understand and reason about music in a nuanced manner could lead to advancements in how machines interact with human creativity and cultural expressions. The main contribution of this paper is the introduction of GaMMA, a large multimodal model that effectively integrates temporal and non-temporal music understanding, alongside the establishment of MusicBench as a comprehensive evaluation benchmark. This work represents a significant advancement in the field of music AI, addressing critical gaps in existing models and providing a robust framework for future research.
A speaker encoder used in multilingual voice cloning should treat the same speaker identically regardless of which script the audio was uttered in. Off-the-shelf encoders do not, and the failure is accent-conditional. On a 1043-pair Western-accented voice corpus across English, Hindi, Telugu, and Tamil, WavLM-base-plus-sv loses 0.082 absolute cosine similarity when the same voice changes script and ECAPA-TDNN loses 0.105. On a 1369-pair Indian-accented voice corpus, the gap shrinks to 0.006 (WavLM-SV) and 0.044 (ECAPA-TDNN). The leak is largest where it matters most for cross-script TTS: when a system projects a non-Indic-trained voice into Indic scripts. We present LASE (Language-Adversarial Speaker Encoder), a small projection head over frozen WavLM-base-plus trained with two losses: a supervised contrastive loss over voice identity, and a gradient-reversal cross-entropy against a 4-language classifier that pushes the embedding to be language-uninformative while remaining speaker-informative. Trained on 1118 quality-gated cross-script pairs synthesised from 8 commercial multilingual voices, LASE's residual gap is consistent with zero on both corpora (Delta = 0.013 Western, Delta = 0.026 Indian; both bootstrap 95% CIs include zero) and amplifies the cross-script-vs-floor margin 2.4-2.7x over both baselines. An ECAPA+GRL ablation shows the GRL objective improves either backbone but the WavLM choice contributes too. In synthetic multi-speaker diarisation, LASE matches ECAPA-TDNN on cross-script speaker recall (0.788 vs 0.789) with ~100x less training data. We release the r1 checkpoint, both corpora, and the bootstrap recipe.
Primary: Praxel Ventures
All Institutions: Praxel Ventures
The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.
The paper introduces a novel approach using a Language-Adversarial Speaker Encoder (LASE) that effectively disentangles language from speaker identity in multilingual voice cloning tasks. The methodology employs a gradient-reversal layer and a supervised contrastive loss to create a speaker embedding that is invariant to language, which is a significant advancement in the field. The architecture is well-defined, consisting of a frozen WavLM-base-plus backbone and a trainable projection head, which allows for efficient training and effective performance on cross-script tasks.
The experiments are robust, utilizing two distinct corpora to evaluate the performance of LASE against established baselines (WavLM-base-plus-sv and ECAPA-TDNN). The results demonstrate a significant reduction in the identity gap across scripts, with LASE achieving a gap of 0.013 compared to 0.082 and 0.105 for the baselines. The paper also includes a thorough analysis of the training dynamics and presents a synthetic multi-speaker diarisation benchmark, showing that LASE can match ECAPA-TDNN's performance with significantly less training data.
The authors provide a comprehensive set of resources, including the model weights, training corpus, and evaluation scripts, which enhances reproducibility. The detailed description of the training process, loss functions, and hyperparameters further supports the ability of other researchers to replicate the results.
The study relies solely on synthetic data generated by ElevenLabs, which may not fully capture the complexities of natural human speech. Additionally, the held-out set shares voices with the training data, limiting the generalization assessment. The paper also acknowledges that the model's performance on real-world data and new voices remains to be evaluated.
The implications of this work are significant for applications in multilingual voice cloning, speaker verification, and diarisation systems, particularly in contexts involving Indian languages. The ability to maintain speaker identity across different scripts can enhance user experience in customer support, content creation, and accessibility technologies. The paper presents LASE, a novel approach to cross-script identity preservation in multilingual voice cloning, demonstrating significant advancements in disentangling language from speaker identity and providing valuable resources for future research. The methodology and results contribute meaningfully to the field of audio processing and speaker recognition, particularly in the context of Indic languages.
Recent advances in multimodal generation have enabled high-quality audio generation from silent videos. Practical applications, such as sound production, demand not only the generated audio but also explicit sound event labels detailing the type and timing of sounds. One straightforward approach involves applying a standard sound event detection to the generated audio. However, this post-hoc pipeline is inherently limited, as it is prone to error accumulation. To address this limitation, we propose MMAudio-LABEL (LAtent-Based Event Labeling), an event-aware audio generation framework built on a foundational audio generation model as its backbone that jointly generates audio and frame-aligned sound event predictions from silent videos. We evaluate our method on the Greatest Hits dataset for onset detection and 17-class material classification. Our approach improves onset-detection accuracy from 46.7% to 75.0% and material-classification accuracy from 40.6% to 61.0% over baselines. These results suggest that jointly learning audio generation and event prediction enables a more interpretable and practical video-to-audio synthesis.
Primary: Sony Group Corporation
All Institutions: Sony Group Corporation, Sony AI
The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.
The proposed MMAudio-LABEL framework innovatively combines audio generation with event labeling in a unified architecture, addressing the limitations of traditional post-hoc sound event detection methods. By leveraging a multimodal transformer and exploring two distinct architectures (Parallel Heads and Joint Heads), the authors demonstrate a thoughtful approach to integrating visual and auditory information. The methodology is well-structured, with clear explanations of the model architecture and training objectives, although further details on the training data preprocessing and augmentation strategies could enhance clarity.
The experiments are robust, utilizing the Greatest Hits dataset to evaluate both onset detection and material classification. The reported improvements in accuracy metrics (from 46.7% to 75.0% for onset detection and from 40.6% to 61.0% for material classification) provide compelling evidence of the framework's effectiveness. However, the paper could benefit from additional comparative analyses against a wider range of baseline models to contextualize the performance gains further.
The implementation details are adequately described, including model architecture, training parameters, and evaluation metrics. However, the absence of a publicly available code repository or demo limits reproducibility. Providing access to the trained models or code would significantly enhance the paper's impact and usability for the research community.
One notable limitation is the reliance on a specific dataset (Greatest Hits), which may not fully represent the diversity of audio events in real-world scenarios. Additionally, the model's performance on less distinctive materials indicates potential challenges in generalization. The paper could also discuss the computational complexity and resource requirements of the proposed framework.
The MMAudio-LABEL framework has significant implications for content creation, immersive media, and human-computer interaction, as it enables more intuitive sound event labeling from silent videos. This could streamline workflows in various industries, including film production and gaming, where accurate audio representation is crucial. The integration of audio generation and event labeling also opens avenues for future research in multimodal learning and generative models. The paper presents MMAudio-LABEL, a novel framework for joint audio generation and event labeling from silent videos, demonstrating significant improvements over existing methods. The technical contributions and methodology are well-articulated, showcasing the potential for broader applications in multimedia content creation and multimodal learning.
Dance serves as both a cultural cornerstone and a medium for personal expression, yet the rapid growth of online dance content has made personalized discovery increasingly difficult. Text-based dance retrieval offers a natural interface for users to search with choreographic intent, but it remains underexplored because dance requires simultaneous reasoning over linguistic semantics, musical rhythm, and full-body motion dynamics. We introduce TD-Data, a large-scale open dataset for text-dance retrieval, containing about 4,000 12-second dance clips, 14.6 hours of motion, 22 genres, and annotations from professional dance experts. On top of this dataset, we propose CustomDancer, a multimodal retrieval framework that aligns text with dance through a CLIP-based text encoder, music and motion encoders, and a music-motion blending module. CustomDancer achieves state-of-the-art performance on TD-Data, reaching 10.23% Recall@1 and improving retrieval quality in both quantitative benchmarks and user preference studies.
Primary: South-Central Minzu University
All Institutions: South-Central Minzu University
The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.
The methodology is robust, introducing a novel multimodal retrieval framework (CustomDancer) that effectively combines text, music, and motion through a well-structured architecture. The use of a CLIP-based text encoder alongside dedicated music and motion encoders is innovative, allowing for a more nuanced understanding of dance retrieval. The music-motion blending module is particularly noteworthy as it captures the interaction between music and motion, which is crucial for dance. The construction of the TD-Data dataset with expert annotations adds significant value, providing a solid foundation for training and evaluation.
The experiments are comprehensive, utilizing multiple evaluation metrics (Recall@K, Median Rank, Mean Rank) that are appropriate for the task. The comparison with strong baselines demonstrates the effectiveness of CustomDancer, and the user study adds a qualitative dimension to the evaluation, confirming that the model aligns well with human judgments. The ablation studies provide insights into the contributions of different components of the model, reinforcing the importance of temporal modeling and feature fusion.
The paper provides detailed implementation details, including the architecture of the encoders and the training objectives. However, the lack of a publicly available code repository or dataset could hinder reproducibility. Future work should consider releasing the code and dataset to facilitate further research in this area.
The paper acknowledges several limitations, including challenges with specialized terminology, conflicts between visual motion and musical affect, and potential performer bias. These factors can impact retrieval accuracy and user satisfaction. Additionally, the dataset's focus on 3D motion and music may overlook important visual elements like costumes and facial expressions.
The work has the potential to significantly impact the fields of dance education, choreography, and creative recommendation systems. By making dance retrieval more accessible, it can facilitate learning and exploration of diverse dance styles. However, the authors emphasize the need for cultural sensitivity in dataset construction and application, highlighting the importance of preserving the context and community significance of dance styles. The main contribution of this paper is the introduction of CustomDancer, a multimodal framework for text-dance retrieval, and the TD-Data dataset, which together advance the state-of-the-art in dance content discovery. The comprehensive methodology, rigorous experimental evaluation, and acknowledgment of limitations underscore the significance of this work in the intersection of machine learning and the performing arts.
To address the limitations of existing Generative Fixed-Filter Active Noise Control (GFANC) methods, which rely on filter decomposition and recombination and require supervised learning with labeled data, this paper proposes a Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework. Unlike previous approaches that predict combination weights of sub control filters, the proposed method directly generates control filters in an unsupervised manner by integrating the co-processor and real-time controller into a fully differentiable ANC system, where the accumulated error signal is used as the training objective. By abandoning the decomposition--reconstruction process, the proposed design simplifies the control pipeline and avoids error accumulation, while the Transformer architecture effectively captures global and dynamic noise characteristics through its attention mechanism. Numerical simulations on real-recorded noises demonstrate that the proposed method achieves improved noise reduction performance and adaptability to different types of noises compared with the original GFANC framework.
Primary: unknown
All Institutions: unknown
The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.
The proposed Transformer-based End-to-End Control-Filter Generation (E2E-CFG) framework represents a significant methodological advancement in active noise control (ANC) by integrating a Transformer architecture for direct control-filter generation. This approach eliminates the need for sub-filter decomposition and recombination, which simplifies the control pipeline and enhances adaptability to varying noise conditions. The unsupervised training paradigm, which relies on minimizing the accumulated residual error, is innovative as it reduces the dependency on labeled data, a common limitation in many machine learning applications. The use of a differentiable ANC system allows for end-to-end training, which is a notable strength of the methodology.
The experimental setup is robust, utilizing a large synthetic dataset of 83,977 noise samples and evaluating the model's performance on both unseen real-world and synthetic noises. The results indicate that the proposed method outperforms the existing GFANC framework in most real-noise scenarios, demonstrating its practical applicability. However, the performance on synthetic noises is mixed, suggesting that while the model excels in real-world conditions, it may not universally outperform all existing methods across all noise types. The evaluation metrics used, particularly the noise reduction (NR) levels, are appropriate for assessing ANC performance.
The paper provides sufficient detail regarding the model architecture, training parameters, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results. Future work could benefit from sharing the implementation details and datasets used for training and testing.
One significant limitation is the reliance on a fixed acoustic path during training and evaluation, which may not generalize well to different acoustic environments without retraining the model. Additionally, the increased complexity of the Transformer-based model, while beneficial for performance, raises concerns about computational efficiency and resource requirements, which could limit its deployment in real-time applications.
The proposed framework has the potential to significantly improve active noise control systems in various applications, including consumer electronics, automotive, and industrial environments. By enhancing adaptability to dynamic noise conditions, this research could lead to more effective noise cancellation solutions, improving user experience and comfort in noisy environments. The implications for real-time processing and deployment in practical scenarios are promising, although further work is needed to address the identified limitations. The paper presents a novel Transformer-based framework for active noise control that simplifies the filter generation process and improves adaptability to real-world noise conditions. This work is significant as it combines advanced neural architectures with practical applications in noise cancellation, potentially leading to enhanced performance in diverse acoustic environments.
Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.
Primary: Reality Defender Inc.
All Institutions: Reality Defender Inc., INRS
The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.
The paper introduces a novel pretraining framework for voice deepfake detection, Alethia, which innovatively combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. This dual-branch approach allows the model to learn robust representations that capture generative artifacts in voice deepfakes, addressing limitations in existing speech foundation models (SFMs) that primarily focus on downstream finetuning. The methodology is well-structured, with a clear explanation of the model architecture, pretraining objectives, and the rationale behind the design choices, such as the use of continuous embeddings instead of discrete tokens.
The experimental evaluation is comprehensive, covering five different tasks across 56 benchmark datasets, which is a significant contribution to the field. The results demonstrate that Alethia outperforms existing SFMs in various metrics, including equal error rate (EER) and accuracy, particularly in challenging scenarios. The zero-shot generalization capability to unseen domains, such as singing deepfakes, is a notable strength of the model. However, the paper could benefit from more detailed ablation studies to further validate the contributions of each component in the proposed framework.
The paper provides a thorough description of the experimental setup, including data preprocessing, model architecture, and training procedures. However, the lack of publicly available code or datasets limits reproducibility. Providing a GitHub repository or links to the datasets used would enhance the ability of other researchers to replicate the findings.
One limitation of the study is the reliance on self-curated datasets for pretraining, which may introduce biases or artifacts not present in real-world data. Additionally, while the model shows promising results, its performance on edge cases or highly diverse datasets remains to be fully explored. The paper also does not address potential ethical implications of deepfake technology, which is crucial given the sensitive nature of the application.
The research has significant implications for the field of audio processing and deepfake detection, contributing to the development of more robust systems that can help mitigate the risks associated with the misuse of deepfake technology. As deepfakes become more prevalent, the ability to detect and localize them effectively is crucial for maintaining trust in digital communications. The main contribution of this paper is the introduction of Alethia, a foundational encoder for voice deepfakes that significantly enhances detection and localization capabilities through an innovative pretraining methodology. This work addresses critical gaps in existing models and sets a new standard for future research in the domain of audio deepfake detection.
Accented automatic speech recognition (ASR) often degrades due to the limited availability of accented training data. Prior work has explored accent modeling in low-resource settings, but existing approaches typically require minutes to hours of labeled speech, which may still be impractical for truly scarce accent scenarios. We propose a pipeline that adapts a text-to-speech (TTS) decoder to a target-accent speaker using fewer than ten reference utterances and employs large language model (LLM)-based phoneme editing to generate accent-conditioned pronunciations. The resulting synthetic speech is used to fine-tune a self-supervised ASR model. Experiments demonstrate consistent word error rate (WER) reductions on real accented speech, including cross-speaker evaluation and ultra-low data regimes. A matched-rate random phoneme baseline shows that phoneme-space perturbation itself is a strong form of augmentation, while LLM-guided edits provide additional gains through accent-conditioned structure.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, National Center for Supercomputing Applications
The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.
The proposed methodology effectively combines few-shot learning with LLM-guided phoneme editing to address the challenge of accent adaptation in ASR systems. The approach is innovative in its use of a phoneme-conditioned TTS model and the integration of LLMs for phoneme editing, which allows for accent-specific pronunciation adjustments while maintaining prosodic alignment. The system's architecture is well-defined, and the use of a matched-rate random phoneme baseline provides a strong comparative framework to evaluate the effectiveness of the LLM-guided edits.
The experiments are comprehensive, evaluating the proposed method across multiple accents (Indian and Korean English) and demonstrating significant improvements in WER through synthetic data generation. The paper provides a clear experimental setup, including detailed descriptions of the datasets, evaluation metrics, and results. The findings indicate that the proposed method not only enhances ASR performance in low-resource scenarios but also shows potential for cross-speaker generalization, which is a critical aspect of practical ASR applications.
The paper includes sufficient implementation details, including training configurations, feature extraction methods, and evaluation protocols, which support reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their code and models to enhance reproducibility.
One notable limitation is that the system inherits prosody from the source speech rather than modeling accent-specific prosodic variations, which may restrict the fidelity of the synthesized speech. Additionally, the adaptation is limited to a single reference speaker, which could affect the generalizability of the results across different speakers and accents. Future work should address these limitations by exploring multi-speaker accent generation and explicit prosody modeling.
The research has significant implications for improving ASR systems in diverse linguistic contexts, particularly for underrepresented accents. By enabling effective accent adaptation with minimal data, this work can contribute to more inclusive speech technologies that better serve global populations. The potential applications extend to various domains, including voice assistants, transcription services, and accessibility tools, enhancing communication for speakers of different accents. The main contribution of this paper is the development of a few-shot accent synthesis pipeline that leverages LLM-guided phoneme editing to improve ASR performance in low-resource settings. This innovative approach not only addresses the challenge of accent adaptation but also demonstrates the effectiveness of combining TTS and ASR technologies to enhance speech recognition across diverse accents.
Wide exploration on robocall surveillance research is hindered due to limited access to public datasets, due to privacy concerns. In this work, we first curate Robo-SAr, a synthetic robocall dataset designed for robocall surveillance research. Robo-SAr comprises of ~200 unwanted and ~1200 legitimate synthetic robocall samples across three realistic adversarial axes: psycholinguistics-manipulated transcripts, emotion-eliciting speech, and cloned voices. We further propose RoboKA, a Kolmogorov-Arnold Network (KAN)-based multimodal fusion framework designed to model structured nonlinear interactions between acoustic and linguistic cues that characterize diverse adversarial robocall strategies. RoboKA first leverages cross-modal contrastive learning to align latent modality representations and feeds the resulting embeddings to a KAN-projection head for final classification. We benchmark RoboKA against strong unimodal and multimodal baselines in both in-domain and out-of-domain setups, finding RoboKA to surpass all baselines in terms of recall and F1-score.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, George Mason University
The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.
The methodology presented in this paper is robust and innovative, leveraging a novel dataset (Robo-SAr) that addresses the limitations of existing datasets in robocall research. The use of Kolmogorov-Arnold Networks (KAN) for multimodal fusion is a significant advancement, as it allows for the modeling of complex nonlinear interactions between audio and text modalities. The cross-modal contrastive learning approach enhances the alignment of representations, which is crucial for effective robocall detection. The authors also provide a clear explanation of their methods and the rationale behind their choices, making the methodology both sound and well-justified.
The experimental evaluation is comprehensive, benchmarking RoboKA against various unimodal and multimodal baselines under different conditions, including in-domain and out-of-domain setups. The results demonstrate a clear performance advantage for RoboKA, particularly in challenging scenarios, which underscores the effectiveness of the proposed approach. The use of human validation for the dataset adds credibility to the findings, although the paper could benefit from more detailed statistical analysis of the results.
The paper commits to releasing the dataset and code upon review, which is a positive step towards ensuring reproducibility. However, the lack of explicit URLs for accessing the dataset and code is a drawback. The methodology is described in sufficient detail to allow for replication, but the absence of a demo or project URL limits immediate accessibility for other researchers.
The paper acknowledges several limitations, including the focus on English language robocalls, which restricts the applicability of the findings to multilingual contexts. Additionally, the reliance on synthetic data raises questions about the generalizability of the results to real-world scenarios. The authors also note that the dataset may not fully capture the complexities of real-world robocalls, which could impact the robustness of the model in practical applications.
The implications of this research are significant, particularly in the context of increasing robocall threats. By providing a robust framework for detecting deceptive robocalls, this work has the potential to enhance consumer protection and inform regulatory efforts. The methodology could also be adapted for other domains where multimodal deception detection is relevant, such as phishing or online scams. The main contribution of this paper is the introduction of Robo-SAr, a novel adversarial dataset for robocall surveillance, and the development of RoboKA, a KAN-informed multimodal framework that significantly improves the detection of unwanted calls. This work addresses critical gaps in the field by providing a comprehensive approach to modeling the complex interactions between audio and linguistic cues, thereby advancing the state of the art in robocall detection.
We show that pretrained acoustic embeddings classify elephant vocalisations at a level approaching that of end-to-end supervised neural networks, without any fine-tuning of the embedding model. This result is of practical importance because annotated bioacoustic data are scarce and costly to obtain, leaving conventional supervised approaches prone to overfitting and to poor generalisation under domain shift. A broad range of embedding models drawn from general audio, speech, and bioacoustic domains is evaluated, all of which are either out-of-domain (containing no bioacoustic data) or out-of-species (containing no elephant call data). The embedding networks themselves remain fixed; only the lightweight downstream classifiers, which include a linear model and several small neural networks, are trained. Among the models considered, Perch 2.0 achieves the best cross-validated classification performance, attaining AUCs of 0.849 on African bush elephant (Loxodonta africana) calls and 0.936 on Asian elephant (Elephas maximus) calls, with Perch 1.0 close behind. The best-performing system is within 2.2 % of an end-to-end supervised elephant call classification system. A layerwise analysis of pretrained transformer encoders, considered as embedding models, shows that intermediate representations outperform final-layer outputs. The second layer of both wav2vec2.0 and HuBERT encodes sufficient information for effective elephant call classification; truncation at this layer therefore preserves classification performance whilst retaining only approximately 10 % of the parameters of the full network. Such compact embedding networks are well suited to on-device processing where computational resources are limited.
Primary: University of Stellenbosch
All Institutions: University of Stellenbosch
The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.
The paper introduces a novel approach to elephant call classification using pretrained acoustic embeddings without fine-tuning, which is significant given the scarcity of annotated bioacoustic data. The methodology is well-structured, employing a variety of embedding models from different domains and evaluating their performance with lightweight classifiers. The choice to analyze intermediate layers of transformer models for their efficacy in classification is particularly innovative, providing insights into the model's internal representations. The segmentation and classification processes are clearly defined, ensuring a robust experimental design.
The experiments are comprehensive, utilizing two distinct datasets for evaluation, which enhances the validity of the results. The performance metrics, including AUC and MAP, are appropriate for the classification task and allow for a nuanced understanding of model effectiveness. The results demonstrate that the best-performing embedding model, Perch 2.0, achieves competitive performance compared to end-to-end supervised models, highlighting the potential of using out-of-domain embeddings in low-resource settings.
The paper provides sufficient detail regarding the experimental setup, including data segmentation, model configurations, and hyperparameter tuning, which supports reproducibility. However, the lack of publicly available code or datasets limits the ease with which other researchers can replicate the study.
One notable limitation is the reliance on pretrained models that may not be strictly out-of-species, particularly with Perch 2.0, which raises questions about the generalizability of the findings. Additionally, the paper does not address potential biases in the datasets or the implications of using embeddings from models trained on other species.
The implications of this research extend beyond elephant call classification, as it demonstrates the utility of pretrained embeddings in bioacoustics, potentially influencing conservation strategies and wildlife management. The approach could be adapted for other endangered species, promoting the use of machine learning in ecological research and conservation efforts. The paper presents a pioneering evaluation of elephant call classification using pretrained acoustic embeddings, achieving significant performance without fine-tuning. This work not only advances the field of bioacoustics but also sets a precedent for leveraging existing models in low-data scenarios, thereby enhancing conservation efforts through automated analysis of wildlife vocalizations.
Audio-based stuttering systems to date have been trained for detection -- what disfluency is present now -- leaving prediction, the capability needed for closed-loop intervention, unstudied at deployable scale. We train a 616K-parameter CNN on SEP-28k (Apple, 20,131 three-second clips) to predict whether the next contiguous clip contains any disfluency. (1) Severity-selective precursor signal: on the episode-grouped test set, aggregate preblock AUC is modest (0.581 [0.542, 0.619]), but stratifying by upcoming event type reveals concentration on clinically severe events -- blocks 0.601 [0.554, 0.651] and sound repetitions 0.617 [0.567, 0.667] both exclude chance, while fillers (0.45) and word repetitions (0.49) are at chance. The aggregate objective converges to a severity-selective predictor because severe events carry prosodic precursors; fillers do not. (2) Cross-population transfer: without fine-tuning, the same checkpoint applied to 1,024 pediatric Children-Who-Stutter utterances (FluencyBank Teaching) attains AUC 0.674 detection and 0.655 prediction; DisfluencySpeech and LibriStutter reach 0.58-0.60 AUC. (3) Deployable on-device: lossless export to CoreML (1.19 MB), ONNX (40 KB), TFLite. Neural-Engine latency per 3 s window: 0.25 ms (iPhone 17 Pro Max, A19 Pro) to 0.55 ms (iPhone SE 3rd-gen and M1 Max). A 4 Hz streaming simulation uses 0.54% of the real-time budget. Platt-calibrated outputs (test ECE 0.010, from 0.177 raw). Five negative ablations -- output-level Future-Guided Learning, multi-clip GRU, time-axis concatenation, asymmetric focal loss, direct block-targeted training -- none improved over the vanilla baseline.
Primary: Kozak Technologies Inc
All Institutions: Kozak Technologies Inc
The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.
The paper employs a convolutional neural network (CNN) architecture specifically designed for predicting stuttering events based on audio input. The methodology is robust, utilizing a well-defined dataset (SEP-28k) and employing a clear training objective that focuses on predicting upcoming disfluencies. The stratification of results by severity of disfluency types is a significant methodological strength, allowing for a nuanced understanding of the model's predictive capabilities. The inclusion of negative ablation studies further strengthens the methodology by demonstrating a thorough exploration of potential improvements that did not yield better results.
The experiments are well-structured, with a clear focus on both detection and prediction tasks. The use of multiple datasets, including cross-population transfer evaluations, enhances the credibility of the findings. The reported AUC scores provide a quantitative measure of performance, and the stratified analysis reveals important insights into the model's strengths and weaknesses. The deployment metrics, including on-device latency and model size, are particularly relevant for practical applications, showcasing the model's readiness for real-world use.
The paper emphasizes reproducibility by providing access to the training code, label-generation scripts, and the trained model weights. The detailed description of the training process, including hyperparameters and data preprocessing steps, further supports reproducibility. The inclusion of a catalog of negative results is a commendable practice that aids future research by preventing redundant efforts.
The paper acknowledges several limitations, including the single-clip context that may restrict the model's performance and the potential for variability across different speakers and datasets. The lack of fine-tuning on external datasets raises questions about the generalizability of the model's predictions. Additionally, the reliance on a coarse label for upcoming events could be improved with more precise annotations.
The research has significant implications for the field of speech therapy and assistive technologies for individuals who stutter. By enabling predictive capabilities in real-time, the model could facilitate closed-loop interventions that provide timely feedback to users. The deployment of such technology on consumer devices could enhance accessibility and usability for a broader audience, potentially improving the quality of life for many individuals. The main contribution of this paper is the development of a predictive model for stuttering events using audio data, demonstrating that a relatively simple CNN can effectively identify clinically severe disfluencies based on prosodic precursors. This work not only advances the understanding of stuttering prediction but also paves the way for practical applications in speech therapy and real-time intervention systems.
Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcribing a target speaker is difficult from audio alone. Visual cues can help resolve speaker ambiguity, yet their integration into long-context audio-visual (AV) ASR systems has been limited. The CHiME-9 MCoRec task addresses this challenge by requiring transcription of audio-visual recordings of heavily-overlapped parallel conversations, followed by clustering the participants into conversational groups. In this work, we present the BUT system based on a long-context target-speaker AV-ASR model capable of processing long-form recordings in a single decoding pass. Our architecture conditions a pre-trained NVIDIA Parakeet-v2 ASR model on visual representations from a pre-trained AV-HuBERT model. To cluster participants into conversation groups, we employ Qwen3.5-122B LLM to estimate transcript topic similarity followed by hierarchical agglomerative clustering. On the MCoRec development set, the proposed system achieves 33.7% WER and a clustering F1 score of 0.97, improving over the official baseline by 16.2% WER and 0.15 F1 absolute. On the eval set, our team ranked second, being 0.16% WER and 0.5% F1 worse than the best system.
Primary: Brno University of Technology
All Institutions: Brno University of Technology
This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.
The proposed methodology integrates audio-visual cues into a long-context ASR system, leveraging pre-trained models (NVIDIA Parakeet-v2 and AV-HuBERT) effectively. The use of a gated mechanism for fusing audio and visual features is a notable innovation, allowing the model to dynamically adjust its reliance on each modality. The clustering approach, which employs a large language model (LLM) for semantic topic similarity, represents a significant departure from traditional heuristic methods. This combination of techniques is well-justified and demonstrates a thoughtful approach to addressing the challenges of multi-talker ASR.
The experimental setup is robust, with clear metrics for both transcription (WER) and clustering (F1 score). The authors provide a thorough analysis of their results, showing substantial improvements over the baseline. However, the reliance on synthetic data for training raises questions about the generalizability of the results to real-world scenarios. The evaluation on both development and eval sets, along with comparisons to baseline systems, adds credibility to their findings.
The paper includes sufficient implementation details, including the training regimen, data preprocessing, and the use of specific frameworks (NeMo and DSPy). The availability of the code on GitHub enhances reproducibility, although the authors could provide more detailed instructions for replicating the experiments.
One limitation is the potential domain mismatch between the synthetic training data and the real-world MCoRec dataset, which could affect the model's performance in practical applications. Additionally, while the clustering approach shows promise, its reliance on LLMs may introduce variability based on the model's performance and the quality of the transcripts.
The advancements in multi-talker ASR have significant implications for applications in various fields, including telecommunications, accessibility for the hearing impaired, and human-computer interaction. The integration of visual cues into ASR systems could lead to more robust and accurate transcription services, enhancing communication in noisy environments. This paper presents a novel approach to multi-talker ASR by integrating audio-visual cues and leveraging LLMs for clustering, achieving significant improvements over existing methods. The methodology is well-structured, and the results indicate a meaningful contribution to the field, although attention to limitations and reproducibility could enhance its impact further.
Conventional neural speech codecs suffer from severe intelligibility degradation at ultra-low bitrates, where the bottleneck transitions from acoustic distortion to semantic loss. To address this issue, this paper conducts a systematic investigation into the role and fundamental limits of integrating frozen semantic priors -- specifically HuBERT and Whisper -- into neural speech coding. We introduce and quantitatively validate a novel Semantic Retirement phenomenon: while semantic constraints reduce the Word Error Rate (WER) by up to ~10% relatively at 1.5 kbps, their benefits rapidly diminish beyond 6 kbps, indicating a practical capacity boundary. We further uncover a clear trade-off between different prior types: acoustic-rich priors (HuBERT) better preserve prosodic and timbral details, whereas high-level linguistic priors (Whisper) effectively suppress phonetic hallucinations in noisy environments (reducing hallucination rates by 26 percent) and substantially narrow the generalization gap for unseen speakers. Building on these findings, we propose a bitrate-aware regulation strategy that dynamically adjusts prior strength to optimize the trade-off between semantic consistency and perceptual naturalness. Extensive experimental evaluations confirm that our approach achieves competitive intelligibility and noise robustness compared to existing baselines, offering a principled pathway toward ultra-low-bitrate generative speech coding.
Primary: Tsinghua Shenzhen International Graduate School, Tsinghua University
All Institutions: Tsinghua Shenzhen International Graduate School, Tsinghua University, Tencent
This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.
The methodology presented in this paper is robust and well-structured. The authors propose a novel framework that integrates frozen semantic priors (HuBERT and Whisper) into a neural speech codec, addressing the challenges of intelligibility degradation at ultra-low bitrates. The introduction of the "Semantic Retirement" phenomenon is a significant contribution, as it quantitatively defines the limits of semantic guidance in speech coding. The bitrate-aware regulation strategy is particularly innovative, allowing the model to dynamically adjust the strength of semantic constraints based on the bitrate, which is a practical approach to optimize performance across varying conditions.
The experimental evaluation is extensive and well-executed, utilizing the LibriSpeech dataset to validate the proposed framework. The authors provide a thorough analysis of the performance metrics, including Word Error Rate (WER), Perceptual Evaluation of Speech Quality (PESQ), and robustness against noise. The results convincingly demonstrate the effectiveness of the proposed method in improving intelligibility and reducing hallucination rates, particularly in low-bitrate scenarios. The ablation studies further strengthen the findings by isolating the effects of different semantic priors and the regulation strategy.
The paper includes sufficient implementation details, such as the architecture of the neural codec, the configuration of the Residual Vector Quantization, and the training setup. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Providing access to the models and datasets used would enhance the ability of other researchers to replicate and build upon this work.
One limitation is the reliance on frozen semantic priors, which may not capture the full range of acoustic nuances needed for optimal performance in all scenarios. Additionally, the paper primarily focuses on two specific priors (HuBERT and Whisper), which may limit the generalizability of the findings to other types of semantic guidance. The authors also acknowledge the potential for over-smoothing at higher bitrates, which could affect the naturalness of the output.
The findings of this research have significant implications for the development of efficient speech coding systems, particularly in applications where bandwidth is severely limited, such as mobile communications and low-bitrate streaming services. The insights gained from the "Semantic Retirement" phenomenon could inform future research on codec design and the integration of semantic information into other audio processing tasks. The approach could also pave the way for advancements in speech synthesis and recognition systems that require high intelligibility in challenging acoustic environments. This paper presents a comprehensive analysis of the role of semantic priors in neural speech coding, introducing a novel framework that enhances intelligibility and robustness at ultra-low bitrates. The innovative methodology and thorough experimental evaluation contribute significantly to the field of audio processing, addressing a critical challenge in speech codec design.
Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.
Primary: Southern University of Science and Technology
All Institutions: Southern University of Science and Technology, Tencent Youtu Lab
The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.
The paper proposes a novel framework for Audio Deepfake Detection (ADD) that emphasizes hard sample classification and utilizes diffusion-based reconstruction methods. The integration of multi-layer feature aggregation and the introduction of Regularization-Assisted Contrastive Learning (RACL) are significant contributions that enhance the model's generalization capabilities. The methodology is well-structured, with clear explanations of the reconstruction paradigms and loss functions employed. However, while the approach is innovative, it builds on existing concepts in contrastive learning and reconstruction, which slightly limits its novelty.
The experiments are comprehensive, evaluating the proposed methods across multiple datasets, including ASVspoof and CodecFake. The results demonstrate a significant reduction in the average Equal Error Rate (EER) compared to baseline models, showcasing the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of different components of the methodology, reinforcing the validity of the findings. However, the paper could benefit from a more detailed analysis of potential edge cases or scenarios where the model may underperform.
The implementation details are sufficiently detailed, including data preprocessing, model architecture, and training parameters, which enhances reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the results directly.
One limitation is the reliance on specific reconstruction methods, which may not generalize well across all types of audio deepfakes. Additionally, the performance on certain datasets showed minor degradation, suggesting that the model may prioritize generalization over specific artifacts. The paper could also discuss potential biases in the datasets used for training and evaluation.
The implications of this research are significant, particularly in the context of security and misinformation, as robust audio deepfake detection systems are crucial for maintaining trust in audio communications. The proposed framework could be applied in various domains, including cybersecurity, media verification, and social media platforms, where audio authenticity is paramount. The main contribution of this paper is the development of a robust framework for Audio Deepfake Detection that leverages hard sample classification and diffusion-based reconstruction to enhance generalization against unseen attacks. This work represents a meaningful advancement in the field of audio deepfake detection, addressing critical challenges posed by evolving generative models.
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of Southern California
The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.
The paper employs a systematic approach to evaluate the limitations of the widely adopted EMO-SIM metric for emotional expressiveness in speech generation. It rigorously tests the metric against three criteria: categorical emotion robustness, dimensional emotion sensitivity, and human perception alignment. The methodology includes adversarial sampling, calibration of latent spaces, and a comprehensive evaluation against human judgments, which is a significant strength. However, the lack of a clear new metric or framework to replace EMO-SIM is a notable gap.
The experiments are well-designed, utilizing diverse datasets and multiple evaluation scenarios to assess the performance of EMO-SIM. The results consistently demonstrate the metric's inadequacy in capturing genuine emotional expressiveness, particularly under various acoustic and linguistic distractors. The statistical analyses, including Spearman's correlation and triplet accuracy, provide robust evidence of the findings. However, the paper could benefit from additional comparisons with existing metrics to contextualize its claims further.
The paper provides sufficient detail on the experimental setup, including dataset preparation and evaluation criteria, which aids reproducibility. However, the absence of publicly available code or datasets limits the ability for other researchers to replicate the findings fully.
The primary limitation is the lack of a proposed alternative metric to EMO-SIM, which leaves a gap in practical applicability. Additionally, the focus on a single metric may overlook other potential evaluation frameworks that could be more effective. The experiments also rely heavily on subjective human evaluations, which may introduce variability.
This work has significant implications for the development of more reliable metrics in speech synthesis and emotional voice conversion, which are critical for applications in human-computer interaction, entertainment, and accessibility technologies. By highlighting the deficiencies of current evaluation methods, it encourages the community to pursue more accurate and meaningful metrics for emotional expressiveness in generated speech. The paper critically examines the limitations of the emotion similarity metric EMO-SIM in evaluating emotional expressiveness in speech generation, revealing its misalignment with human perception and robustness issues. This comprehensive analysis challenges existing methodologies and underscores the need for improved evaluation frameworks in the field.