Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Primary: Tsinghua University
All Institutions: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua University, Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University
The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
The methodology for constructing the HQ-MPSD dataset is robust and innovative. It employs a three-stage process for generating partial deepfake speech that emphasizes linguistic coherence and acoustic fidelity. The use of fine-grained forced alignment for splice points and the normalization of loudness and spectral characteristics are noteworthy techniques that enhance the quality of the dataset. Additionally, the incorporation of background effects to simulate real-world conditions is a significant improvement over existing datasets. The careful design choices made to minimize artifacts and ensure natural transitions contribute to the dataset's overall quality and applicability for training detection models.
The experiments conducted using HQ-MPSD are comprehensive and well-structured. The cross-language and cross-dataset evaluations provide valuable insights into the generalization capabilities of state-of-the-art detection models. The performance drop observed in existing models when tested on HQ-MPSD highlights the dataset's effectiveness in revealing the limitations of current methodologies. The use of metrics such as Equal Error Rate (EER) and Area Under the Curve (AUC) for evaluation is appropriate and provides a clear understanding of model performance.
The paper provides sufficient detail regarding the dataset generation process and experimental setup, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The dataset itself is accessible, which is a positive aspect for researchers looking to build upon this work.
While the dataset is a significant advancement, it may still have limitations regarding the diversity of accents and dialects within the eight languages represented. Additionally, the reliance on forced alignment may introduce its own biases, particularly if the alignment tools are not perfectly accurate. The paper does not address potential ethical concerns related to the misuse of deepfake technology, which is an important consideration in this field.
The development of HQ-MPSD has the potential to significantly advance the field of deepfake detection by providing a high-quality, multilingual benchmark that can improve the robustness of detection models. The dataset's design encourages the exploration of genuine manipulation cues rather than superficial artifacts, which can lead to more effective solutions in real-world applications. This work is particularly relevant in the context of misinformation and security, where the ability to detect partial deepfake speech can have substantial societal implications. The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
Primary: Zhipu AI
All Institutions: Zhipu AI, Tsinghua University
GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
The methodology of GLM-TTS is well-structured, utilizing a two-stage architecture that effectively combines autoregressive and diffusion models for TTS. The introduction of a multi-reward reinforcement learning framework is particularly innovative, addressing common challenges in TTS systems such as pronunciation accuracy and emotional expressiveness. The use of a hybrid phoneme-text input scheme and optimized speech tokenizer enhances the system's controllability and adaptability, especially for languages with complex phonetic structures like Chinese. The detailed data processing pipeline and the enhancements made to the speech tokenizer demonstrate a thorough understanding of the underlying challenges in TTS.
The experiments conducted are comprehensive, comparing GLM-TTS against state-of-the-art models across various benchmarks. The results indicate that GLM-TTS achieves competitive performance with significantly less training data, which is a notable achievement. The evaluation metrics used, including CER, WER, and SIM, provide a clear picture of the system's capabilities. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper provides a link to the code repository and demo, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific datasets used are somewhat limited. More explicit information on the experimental setup would enhance reproducibility.
One limitation is the reliance on proprietary datasets, which may hinder the generalizability of the results. Additionally, while the system shows promise in emotional expressiveness, the paper acknowledges that the performance may vary across different emotional contexts, indicating potential areas for improvement. The complexity of the model may also pose challenges for deployment in resource-constrained environments.
The GLM-TTS system has significant implications for various applications, including virtual assistants, educational tools, and content creation. Its ability to generate high-fidelity, expressive speech with reduced training data makes it accessible for low-resource scenarios, potentially democratizing TTS technology. The focus on controllability and customization also opens avenues for personalized applications in diverse linguistic contexts. GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
Primary: Indian Institute of Technology Hyderabad
All Institutions: Indian Institute of Technology Hyderabad
The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
The proposed joint multimodal contrastive learning framework effectively integrates audio and text modalities into a unified embedding space, addressing significant limitations of existing Acoustic Word Embedding (AWE) methods. The dual optimization of audio-text and audio-audio contrastive learning is innovative, leveraging the strengths of both modalities while enhancing intra-class compactness and inter-class separation. The methodology is well-structured, with clear explanations of the loss functions and training regime, although further details on hyperparameter tuning could enhance clarity.
The experiments are robust, utilizing the LibriSpeech corpus to evaluate the proposed model against multiple baselines. The performance metrics, including Average Precision (AP) and Equal Error Rates (EER), provide a comprehensive view of the model's capabilities in both Spoken Term Detection (STD) and Keyword Spotting (KWS). The results demonstrate consistent improvements over existing methods, particularly in challenging conditions, which underscores the effectiveness of the proposed approach.
The authors emphasize reproducibility by releasing a standardized evaluation framework and the trial generation recipe alongside their codebase. This commitment to transparency is commendable and facilitates further research in the field. However, more detailed documentation on the training process and hyperparameter settings would be beneficial for full reproducibility.
While the paper presents a significant advancement, it does not extensively discuss the potential computational costs associated with the proposed model, particularly in real-time applications. Additionally, the reliance on the LibriSpeech dataset may limit the generalizability of the findings to other languages or dialects.
The proposed framework has the potential to significantly improve spoken content retrieval systems, making them more robust to variations in speaker and background noise. This advancement could enhance accessibility in various applications, such as voice-activated systems and automated transcription services, thereby contributing to the broader adoption of speech technologies. The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
Speech Emotion Recognition (SER) systems often degrade in performance when exposed to the unpredictable acoustic interference found in real-world environments. Additionally, the opacity of deep learning models hinders their adoption in trust-sensitive applications. To bridge this gap, we propose a Hybrid Transformer-CNN framework that unifies the contextual modeling of Wav2Vec 2.0 with the spectral stability of 1D-Convolutional Neural Networks. Our dual-stream architecture processes raw waveforms to capture long-range temporal dependencies while simultaneously extracting noise-resistant spectral features (MFCC, ZCR, RMSE) via a custom Attentive Temporal Pooling mechanism. We conducted extensive validation across four diverse benchmark datasets: RAVDESS, TESS, SAVEE, and CREMA-D. To rigorously test robustness, we subjected the model to non-stationary acoustic interference using real-world noise profiles from the SAS-KIIT dataset. The proposed framework demonstrates superior generalization and state-of-the-art accuracy across all datasets, significantly outperforming single-branch baselines under realistic environmental interference. Furthermore, we address the ``black-box" problem by integrating SHAP and Score-CAM into the evaluation pipeline. These tools provide granular visual explanations, revealing how the model strategically shifts attention between temporal and spectral cues to maintain reliability in the presence of complex environmental noise.
Primary: KIIT Deemed to be University
All Institutions: AmygdalaAI-India Lab, KIIT Deemed to be University
The main contribution of this paper is the development of a Hybrid Transformer-CNN framework for noise-robust speech emotion recognition that effectively combines temporal and spectral features while providing explainability through SHAP and Score-CAM. This research significantly advances the state-of-the-art in SER by addressing both performance and interpretability, making it highly relevant for practical applications in diverse acoustic environments.
The proposed methodology combines a Transformer model (Wav2Vec 2.0) with a CNN to create a dual-stream architecture that captures both long-range temporal dependencies and noise-resistant spectral features. The integration of an Attentive Temporal Pooling mechanism is a significant innovation, allowing the model to focus on relevant features dynamically. The methodology is well-structured, with a clear explanation of the noise injection protocols and feature extraction processes, which are crucial for achieving robustness in real-world scenarios.
The experiments are comprehensive, utilizing multiple benchmark datasets (RAVDESS, TESS, SAVEE, CREMA-D) and a custom noise dataset (SAS-KIIT) to rigorously test the model's performance under various conditions. The results demonstrate superior accuracy and robustness compared to single-branch baselines, particularly in non-stationary noise environments. The ablation study further validates the contributions of each component of the architecture, providing strong evidence for the effectiveness of the proposed approach.
While the paper provides a detailed description of the methodology and experiments, it lacks specific implementation details such as code availability or links to datasets, which are critical for reproducibility. The absence of a demo URL or project repository limits the ability for other researchers to replicate the findings or build upon this work.
One limitation is the reliance on specific datasets, which may not fully represent all real-world acoustic environments. Additionally, while the model shows robustness against various noise types, further testing against a broader range of environmental conditions could provide a more comprehensive evaluation of its generalizability. The paper also does not discuss the computational efficiency of the proposed model, which could be a concern for deployment in real-time applications.
The research has significant implications for applications in human-computer interaction, mental health monitoring, and customer service, where understanding emotional intent from speech is critical. By addressing the robustness and interpretability of SER systems, this work paves the way for more reliable and trustworthy AI applications in sensitive domains. The main contribution of this paper is the development of a Hybrid Transformer-CNN framework for noise-robust speech emotion recognition that effectively combines temporal and spectral features while providing explainability through SHAP and Score-CAM. This research significantly advances the state-of-the-art in SER by addressing both performance and interpretability, making it highly relevant for practical applications in diverse acoustic environments.
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.
Primary: unknown
All Institutions: unknown
The paper presents a novel data-centric approach to generalizable speech deepfake detection, significantly advancing the field by demonstrating the importance of data composition in model performance. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and practical applications in combating deepfake technologies.
The paper introduces a data-centric approach to speech deepfake detection (SDD) by focusing on data composition rather than solely on model architecture. It proposes the Diversity-Optimized Sampling Strategy (DOSS), which includes two implementations: DOSS-Select for pruning and DOSS-Weight for re-weighting. The empirical study on data scaling laws is well-structured, providing a solid foundation for the proposed methods. The methodology is innovative in its systematic analysis of how source and generator diversity impact model generalization, which is a significant shift from traditional model-centric approaches.
The experiments are comprehensive, utilizing a large-scale data pool and multiple datasets to validate the proposed methods. The results demonstrate a clear improvement in generalization performance over naive aggregation methods, with DOSS-Weight achieving state-of-the-art results. The evaluation metrics used (EER and ACC) are appropriate for the task, and the experiments are well-documented, allowing for a clear understanding of the impact of the proposed strategies.
The paper provides sufficient details on the experimental setup, including data generation, model architecture, and training configuration. However, the lack of URLs for code or data repositories limits reproducibility. While the methodology is clear, the absence of a public implementation may hinder other researchers from validating the results independently.
The paper acknowledges its focus on English and Chinese languages, which may limit the generalizability of the findings to other languages. Additionally, the study does not explore the interaction between data composition and model scaling, which could be a valuable area for future research. The reliance on a fixed model architecture may also restrict the applicability of the findings to other architectures.
The findings have significant implications for the development of robust speech deepfake detection systems, especially as synthetic speech technologies continue to evolve. By emphasizing data composition, this work encourages future research to explore data-centric approaches in other domains of machine learning, potentially leading to more generalizable models across various applications. The paper presents a novel data-centric approach to generalizable speech deepfake detection, significantly advancing the field by demonstrating the importance of data composition in model performance. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and practical applications in combating deepfake technologies.
Achieving robust generalization in speech deepfake detection (SDD) remains a primary challenge, as models often fail to detect unseen forgery methods. While research has focused on model-centric and algorithm-centric solutions, the impact of data composition is often underexplored. This paper proposes a data-centric approach, analyzing the SDD data landscape from two practical perspectives: constructing a single dataset and aggregating multiple datasets. To address the first perspective, we conduct a large-scale empirical study to characterize the data scaling laws for SDD, quantifying the impact of source and generator diversity. To address the second, we propose the Diversity-Optimized Sampling Strategy (DOSS), a principled framework for mixing heterogeneous data with two implementations: DOSS-Select (pruning) and DOSS-Weight (re-weighting). Our experiments show that DOSS-Select outperforms the naive aggregation baseline while using only 3% of the total available data. Furthermore, our final model, trained on a 12k-hour curated data pool using the optimal DOSS-Weight strategy, achieves state-of-the-art performance, outperforming large-scale baselines with greater data and model efficiency on both public benchmarks and a new challenge set of various commercial APIs.
Primary: unknown
All Institutions: unknown
This paper proposes a data-centric approach to generalization in speech deepfake detection. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the innovative nature of the DOSS framework and its potential to advance the state of the art in SDD.
The paper introduces a data-centric approach to speech deepfake detection (SDD) that emphasizes the importance of data composition over traditional model-centric strategies. The proposed Diversity-Optimized Sampling Strategy (DOSS) is innovative, offering two distinct implementations (DOSS-Select and DOSS-Weight) that effectively manage heterogeneous data mixtures. The methodology is well-structured, with clear definitions of the scaling laws and empirical studies that validate the approach. However, the paper could benefit from a more detailed discussion on the theoretical underpinnings of the DOSS framework and its adaptability to different contexts beyond SDD.
The experiments are comprehensive, involving a large-scale empirical study that quantifies the impact of source and generator diversity on model performance. The results demonstrate that DOSS-Select and DOSS-Weight significantly outperform naive aggregation methods, achieving state-of-the-art performance with reduced data usage. The evaluation metrics are robust, including Equal Error Rate (EER) and Accuracy (ACC), which provide a balanced view of model performance. However, the paper lacks a thorough comparison with other contemporary methods that also utilize data-centric approaches.
The paper provides sufficient implementation details, including data generation methods, model architecture, and training configurations, which enhance reproducibility. However, the absence of a publicly available code repository or dataset limits the ability for others to replicate the results fully. Including links to datasets or code would significantly improve this aspect.
The primary limitation noted is the focus on English and Chinese datasets, which may restrict the model's generalization to other languages. Additionally, the study does not explore the interplay between data scaling laws and model architecture variations, which could provide insights into optimizing performance across different settings. The authors also acknowledge that their findings are based on a fixed model architecture, which may not capture the full potential of the proposed data-centric strategies.
The findings of this research have significant implications for the field of speech synthesis and detection, particularly in enhancing the robustness of detection systems against evolving deepfake technologies. The data-centric approach could be applied to other domains where data composition plays a critical role in model performance, potentially leading to more efficient and effective machine learning systems. This paper proposes a data-centric approach to generalization in speech deepfake detection. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the innovative nature of the DOSS framework and its potential to advance the state of the art in SDD.
Hierarchical representations provide powerful and principled approaches for analyzing many musical genres. Such representations have been broadly studied in music theory, for instance via Schenkerian analysis (SchA). Hierarchical music analyses, however, are highly cost-intensive; the analysis of a single piece of music requires a great deal of time and effort from trained experts. The representation of hierarchical analyses in a computer-readable format is a further challenge. Given recent developments in hierarchical deep learning and increasing quantities of computer-readable data, there is great promise in extending such work for an automatic hierarchical representation framework. This paper thus introduces a novel approach, AutoSchA, which extends recent developments in graph neural networks (GNNs) for hierarchical music analysis. AutoSchA features three key contributions: 1) a new graph learning framework for hierarchical music representation, 2) a new graph pooling mechanism based on node isolation that directly optimizes learned pooling assignments, and 3) a state-of-the-art architecture that integrates such developments for automatic hierarchical music analysis. We show, in a suite of experiments, that AutoSchA performs comparably to human experts when analyzing Baroque fugue subjects.
Primary: unknown
All Institutions: unknown
This paper introduces AutoSchA, a novel framework for automatic hierarchical music representation using graph neural networks, demonstrating significant potential for advancing music analysis and theory through machine learning techniques. The innovative approach and promising experimental results position this work as a meaningful contribution to the intersection of music and machine learning.
The methodology presented in this paper is innovative, particularly in its application of graph neural networks (GNNs) to hierarchical music representation. The introduction of a novel graph pooling mechanism based on node isolation is a significant advancement, allowing for adaptive node removal that reflects the structural importance of musical notes. The authors effectively frame Schenkerian analysis as a graph pooling problem, which is a unique approach in the context of music analysis. However, the paper could benefit from a clearer exposition of the theoretical underpinnings of the proposed methods, particularly in relation to existing GNN techniques.
The experimental design is robust, utilizing a well-defined dataset of Schenkerian analyses and comparing the proposed model against multiple baseline methods, including both non-graph and graph-based approaches. The results indicate that AutoSchA performs comparably to human experts, which is a strong validation of the model's effectiveness. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance testing, which would strengthen the claims made regarding performance.
The paper provides a GitHub repository for supplemental material, which is a positive step towards reproducibility. However, the main text lacks detailed implementation instructions, hyperparameter settings, and specific data preprocessing steps that would be necessary for full reproducibility. Including these details would greatly enhance the paper's utility for other researchers.
One limitation of the study is the reliance on a relatively small dataset of Schenkerian analyses, which may affect the generalizability of the findings. Additionally, while the model shows promise, it still exhibits a higher number of perceived awkward components compared to human analyses, indicating that there is room for improvement in the model's understanding of musical structure.
The potential applications of this research are significant, particularly in the fields of music theory, education, and AI-assisted music composition. By providing a framework for automatic hierarchical music analysis, the work could facilitate new tools for music educators and composers, enhancing the understanding and creation of complex musical structures. The integration of AI in music analysis also raises interesting questions about the role of human expertise in creative domains. This paper introduces AutoSchA, a novel framework for automatic hierarchical music representation using graph neural networks, demonstrating significant potential for advancing music analysis and theory through machine learning techniques. The innovative approach and promising experimental results position this work as a meaningful contribution to the intersection of music and machine learning.
Robust Voice Activity Detection (VAD) remains a challenging task, especially under noisy, diverse, and unseen acoustic conditions. Beyond algorithmic development, a key limitation in advancing VAD research is the lack of large-scale, systematically controlled, and publicly available datasets. To address this, we introduce LibriVAD - a scalable open-source dataset derived from LibriSpeech and augmented with diverse real-world and synthetic noise sources. LibriVAD enables systematic control over speech-to-noise ratio, silence-to-speech ratio (SSR), and noise diversity, and is released in three sizes (15 GB, 150 GB, and 1.5 TB) with two variants (LibriVAD-NonConcat and LibriVAD-Concat) to support different experimental setups. We benchmark multiple feature-model combinations, including waveform, Mel-Frequency Cepstral Coefficients (MFCC), and Gammatone filter bank cepstral coefficients, and introduce the Vision Transformer (ViT) architecture for VAD. Our experiments show that ViT with MFCC features consistently outperforms established VAD models such as boosted deep neural network and convolutional long short-term memory deep neural network across seen, unseen, and out-of-distribution (OOD) conditions, including evaluation on the real-world VOiCES dataset. We further analyze the impact of dataset size and SSR on model generalization, experimentally showing that scaling up dataset size and balancing SSR noticeably and consistently enhance VAD performance under OOD conditions. All datasets, trained models, and code are publicly released to foster reproducibility and accelerate progress in VAD research.
Primary: Aalborg University
All Institutions: Aalborg University, Pioneer Centre for AI, IIIT SriCity, Zoom Communications Inc., Massachusetts Institute of Technology
The paper introduces LibriVAD, a scalable open-source dataset for voice activity detection, and benchmarks a novel application of the Vision Transformer architecture, demonstrating significant improvements over existing models. The comprehensive methodology and rigorous experimental evaluation position this work as a meaningful contribution to the field of machine learning and audio processing.
The paper presents a comprehensive methodology for creating the LibriVAD dataset, which includes systematic control over various acoustic conditions, such as SNR and SSR. The introduction of the Vision Transformer (ViT) for VAD tasks is a significant methodological innovation, leveraging its ability to model long-range dependencies. The feature extraction methods are well-defined, and the benchmarking against established models provides a solid foundation for evaluating the proposed approach.
The experiments are rigorously designed, utilizing multiple dataset sizes and configurations to assess the performance of the ViT architecture against conventional VAD models. The results demonstrate the effectiveness of the proposed dataset and model, with clear metrics provided (AUC, EER, MinDCF) that validate the findings. The evaluation on the VOiCES dataset further enhances the robustness of the results.
The authors have made a commendable effort to ensure reproducibility by publicly releasing the dataset, trained models, and code. This transparency is crucial for fostering further research in VAD and related fields. The detailed experimental setup and evaluation metrics also contribute to the reproducibility of the results.
While the dataset is extensive, it may still be limited in terms of diversity across different languages, accents, and speech styles, which could affect the generalizability of the VAD systems. Additionally, the performance on out-of-distribution data, while promising, may require further validation across a wider range of real-world conditions.
The introduction of LibriVAD has the potential to significantly advance research in voice activity detection and related areas of speech processing. By providing a large-scale, publicly available dataset, it opens avenues for developing more robust VAD systems that can operate effectively in diverse acoustic environments. This could lead to improvements in applications such as automatic speech recognition, speaker identification, and other speech-related technologies. The paper introduces LibriVAD, a scalable open-source dataset for voice activity detection, and benchmarks a novel application of the Vision Transformer architecture, demonstrating significant improvements over existing models. The comprehensive methodology and rigorous experimental evaluation position this work as a meaningful contribution to the field of machine learning and audio processing.
Speech enhancement methods are commonly believed to improve the performance of automatic speech recognition (ASR) in noisy environments. However, the effectiveness of these techniques cannot be taken for granted in the case of modern large-scale ASR models trained on diverse, noisy data. We present a systematic evaluation of MetricGAN-plus-voicebank denoising on four state-of-the-art ASR systems: OpenAI Whisper, NVIDIA Parakeet, Google Gemini Flash 2.0, Parrotlet-a using 500 medical speech recordings under nine noise conditions. ASR performance is measured using semantic WER (semWER), a normalized word error rate (WER) metric accounting for domain-specific normalizations. Our results reveal a counterintuitive finding: speech enhancement preprocessing degrades ASR performance across all noise conditions and models. Original noisy audio achieves lower semWER than enhanced audio in all 40 tested configurations (4 models x 10 conditions), with degradations ranging from 1.1% to 46.6% absolute semWER increase. These findings suggest that modern ASR models possess sufficient internal noise robustness and that traditional speech enhancement may remove acoustic features critical for ASR. For practitioners deploying medical scribe systems in noisy clinical environments, our results indicate that preprocessing audio with noise reduction techniques might not just be computationally wasteful but also be potentially harmful to the transcription accuracy.
Primary: EkaCare
All Institutions: EkaCare
This paper presents a critical examination of the effects of speech enhancement on modern ASR systems in medical contexts, revealing that such preprocessing can degrade performance rather than improve it. The comprehensive methodology and significant findings challenge existing paradigms in the field, highlighting the need for further investigation into the relationship between enhancement techniques and ASR performance.
The methodology is robust, employing a systematic evaluation of the MetricGAN-plus-voicebank denoising technique across four state-of-the-art ASR systems. The authors clearly define their research question, null hypothesis, and the noise conditions under which the experiments are conducted. The use of a well-defined dataset of medical recordings and the application of semantic WER (semWER) as a performance metric tailored for the medical domain enhances the study's relevance and rigor. However, the study is limited to a single enhancement method, which may not capture the full spectrum of speech enhancement techniques.
The experiments are comprehensive, involving 500 medical recordings under nine noise conditions, leading to a total of 40 configurations tested. The results consistently show that speech enhancement degrades ASR performance, which is a significant finding. The detailed analysis of results across different noise types and ASR models provides valuable insights into the interaction between noise and enhancement techniques. However, the reliance on synthetic noise may limit the generalizability of the findings to real-world scenarios.
The authors have made their evaluation code, dataset, and detailed results publicly available, which greatly enhances reproducibility. The clear description of the methodology and the resources provided allow other researchers to replicate the study effectively.
The study's limitations include the focus on a single enhancement method, the use of synthetic noise rather than real-world recordings, and a relatively small dataset size of 500 recordings. Additionally, the findings may not generalize to other ASR systems or domains outside of medical speech recognition.
The implications of this work are significant for practitioners in the field of medical ASR, suggesting that traditional speech enhancement techniques may be counterproductive in certain contexts. This challenges long-held assumptions about preprocessing in ASR and encourages a reevaluation of enhancement methods in deployment scenarios. The findings could lead to more effective ASR systems in noisy environments, ultimately improving clinical documentation processes. This paper presents a critical examination of the effects of speech enhancement on modern ASR systems in medical contexts, revealing that such preprocessing can degrade performance rather than improve it. The comprehensive methodology and significant findings challenge existing paradigms in the field, highlighting the need for further investigation into the relationship between enhancement techniques and ASR performance.
This paper presents a lightweight text-to-speech (TTS) system developed for the WildSpoof Challenge TTS Track. Our approach fine-tunes the recently released open-weight TTS model, \textit{Supertonic}\footnote{\url{https://github.com/supertone-inc/supertonic}}, with Self-Purifying Flow Matching (SPFM) to enable robust adaptation to in-the-wild speech. SPFM mitigates label noise by comparing conditional and unconditional flow matching losses on each sample, routing suspicious text--speech pairs to unconditional training while still leveraging their acoustic information. The resulting model achieves the lowest Word Error Rate (WER) among all participating teams, while ranking second in perceptual metrics such as UTMOS and DNSMOS. These findings demonstrate that efficient, open-weight architectures like Supertonic can be effectively adapted to diverse real-world speech conditions when combined with explicit noise-handling mechanisms such as SPFM.
Primary: Supertonic Inc.
All Institutions: Supertonic Inc., XYZ agency
This paper presents a novel approach to robust TTS training through Self-Purifying Flow Matching, demonstrating significant improvements in performance metrics within a challenging dataset. The combination of innovative methodology and solid experimental results positions this work as a meaningful contribution to the field of machine learning and TTS systems.
The proposed methodology, Self-Purifying Flow Matching (SPFM), is a notable advancement in handling label noise in TTS systems. By dynamically routing suspicious text-speech pairs to unconditional training, the authors effectively mitigate the impact of mislabeled data, which is a common issue in real-world datasets. The integration of SPFM with the Supertonic architecture demonstrates a thoughtful approach to improving robustness without significantly increasing computational complexity. The paper provides a clear explanation of the flow matching losses and the rationale behind the SPFM mechanism, which enhances the understanding of its operational principles.
The experimental setup is well-structured, utilizing both the TITW-easy and TITW-hard datasets to ensure a balanced training regime. The reported results, including the lowest WER and competitive perceptual metrics, validate the effectiveness of the proposed method. However, the paper could benefit from more detailed statistical analysis of the results and comparisons with baseline models to further substantiate the claims of improvement.
The paper mentions the use of a publicly available checkpoint for Supertonic and provides a clear description of the training parameters and setup. However, it lacks specific details on the implementation of SPFM, such as hyperparameter choices and the exact computational resources used, which could hinder reproducibility. Including a link to the code repository is a positive aspect, but more comprehensive documentation would be beneficial.
One limitation is the reliance on the quality of the initial Supertonic model, which may affect the overall performance if the base model has inherent flaws. Additionally, while SPFM shows promise in mitigating label noise, the effectiveness of this approach in more diverse or extreme noisy conditions remains to be tested. The paper does not discuss potential scalability issues or the impact of varying dataset sizes on the performance of the model.
The findings of this research have significant implications for the development of TTS systems in real-world applications, particularly in environments where data quality is variable. The approach could be applied to other domains where label noise is prevalent, potentially improving the robustness of machine learning models across various applications. This work contributes to the ongoing efforts to make TTS systems more accessible and effective in diverse settings. This paper presents a novel approach to robust TTS training through Self-Purifying Flow Matching, demonstrating significant improvements in performance metrics within a challenging dataset. The combination of innovative methodology and solid experimental results positions this work as a meaningful contribution to the field of machine learning and TTS systems.
General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of SAM Audio, a foundation model for general audio separation that integrates multiple prompting modalities within a diffusion transformer architecture. This work represents a significant advancement in audio source separation, showcasing innovative methodologies and robust experimental results that could influence future research in multimodal AI systems.
The methodology presented in SAM Audio is noteworthy for its integration of multiple prompting modalities (text, visual, and temporal) within a single framework. The use of a diffusion transformer architecture is innovative, particularly in the context of audio source separation, which has traditionally relied on more conventional neural network architectures. The flow matching training approach is also a significant contribution, as it allows for the model to be trained effectively on large-scale audio datasets. However, the paper could benefit from a more detailed explanation of the training process and the specific design choices made in the architecture.
The experimental evaluation is robust, with the authors demonstrating state-of-the-art performance across a diverse set of benchmarks. The introduction of a new real-world separation benchmark with human-labeled multimodal prompts is a strong point, as it enhances the relevance of the evaluation metrics. The results indicate substantial improvements over both general-purpose and specialized systems, which speaks to the effectiveness of the proposed model. However, additional details on the datasets used and the specific metrics for evaluation would enhance the credibility of the results.
The paper lacks sufficient details regarding the implementation of the model, which raises concerns about reproducibility. While the results are promising, the absence of a publicly available code repository or detailed algorithmic descriptions makes it difficult for other researchers to replicate the findings or build upon this work.
One limitation of the study is the potential overfitting to the specific datasets used for training and evaluation, which could affect the generalizability of the model. Additionally, while the model supports multiple prompting modalities, the paper does not thoroughly explore the limitations of each modality or the conditions under which the model may fail to perform optimally.
The implications of SAM Audio are significant, particularly in the context of multimodal AI systems. The ability to separate audio sources based on various prompts can enhance applications in fields such as music production, film editing, and assistive technologies for the hearing impaired. The model's versatility could lead to advancements in how machines understand and interact with audio in a more human-like manner. The main contribution of this paper is the introduction of SAM Audio, a foundation model for general audio separation that integrates multiple prompting modalities within a diffusion transformer architecture. This work represents a significant advancement in audio source separation, showcasing innovative methodologies and robust experimental results that could influence future research in multimodal AI systems.
Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character's visual performance. However, existing alignment approaches based on visual features face two key limitations: (1)they rely on complex, handcrafted visual preprocessing pipelines, including facial landmark detection and feature extraction; and (2) they generalize poorly to unseen visual domains, often resulting in degraded alignment and dubbing quality. To address these issues, we propose InstructDubber, a novel instruction-based alignment dubbing method for both robust in-domain and zero-shot movie dubbing. Specifically, we first feed the video, script, and corresponding prompts into a multimodal large language model to generate natural language dubbing instructions regarding the speaking rate and emotion state depicted in the video, which is robust to visual domain variations. Second, we design an instructed duration distilling module to mine discriminative duration cues from speaking rate instructions to predict lip-aligned phoneme-level pronunciation duration. Third, for emotion-prosody alignment, we devise an instructed emotion calibrating module, which finetunes an LLM-based instruction analyzer using ground truth dubbing emotion as supervision and predicts prosody based on the calibrated emotion analysis. Finally, the predicted duration and prosody, together with the script, are fed into the audio decoder to generate video-aligned dubbing. Extensive experiments on three major benchmarks demonstrate that InstructDubber outperforms state-of-the-art approaches across both in-domain and zero-shot scenarios.
Primary: VIPL group
All Institutions: VIPL group
InstructDubber presents a novel approach to zero-shot movie dubbing by integrating instruction-based alignment with multimodal processing, addressing significant challenges in the field. The proposed methodology and experimental validation demonstrate its potential to advance the state-of-the-art in audio-visual synchronization and dubbing quality.
The methodology presented in InstructDubber is innovative, leveraging a multimodal large language model (LLM) to generate dubbing instructions that address the limitations of traditional visual feature-based approaches. The use of an instructed duration distilling module and an instructed emotion calibrating module reflects a thoughtful integration of language processing with audio-visual synchronization, which is a significant advancement in the field of movie dubbing. The approach is well-structured, moving from instruction generation to phoneme-level duration prediction and emotion-prosody alignment, showcasing a comprehensive pipeline that is both novel and effective.
The experiments conducted on three major benchmarks are extensive and demonstrate a clear performance improvement over state-of-the-art methods in both in-domain and zero-shot scenarios. The paper provides sufficient details on the datasets used, evaluation metrics, and comparative results, which support the claims of superior performance. However, the paper could benefit from more detailed analysis on the specific aspects of the benchmarks that highlight the advantages of the proposed method.
While the paper outlines the methodology and experiments, it lacks detailed implementation specifics that would aid in reproducibility. There is no mention of code availability or a repository for others to access the implementation, which is a critical aspect for validating the results. The demo URL provides some insight into the application, but further details on the experimental setup would enhance reproducibility.
One limitation noted is the reliance on the quality of the input video and script, which could affect the overall dubbing quality. Additionally, the model's performance in highly diverse or unconventional visual domains remains untested, which could limit its applicability. The paper does not address potential biases in the LLM's instruction generation, which could impact the emotional and prosodic accuracy of the dubbing.
The potential applications of InstructDubber extend beyond movie dubbing to include video games, virtual reality, and educational content, where accurate dubbing and emotional alignment are crucial. The approach could significantly enhance user experience in multimedia applications, making it a valuable contribution to the field. Furthermore, the integration of LLMs in multimedia processing opens avenues for future research in multimodal AI, potentially influencing various domains within machine learning. InstructDubber presents a novel approach to zero-shot movie dubbing by integrating instruction-based alignment with multimodal processing, addressing significant challenges in the field. The proposed methodology and experimental validation demonstrate its potential to advance the state-of-the-art in audio-visual synchronization and dubbing quality.
Applying speech super-resolution (SR) to recordings with severely low sampling rates is a critical challenge in digital archiving and investigative audio recovery. In these scenarios, the input lacks essential acoustic cues. Consequently, existing generative models often fail; without sufficient context, they hallucinate phonetic content, guessing words based on probability rather than meaning. To address this, we propose CogSR, a framework designed specifically for high-precision, offline restoration. Our approach shifts the focus from simple signal mapping to cognitive reconstruction. By integrating a Large Audio-Language Model, we employ Chain-of-Thought reasoning to act as a semantic anchor, while explicit acoustic priors ensure the speaker's identity remains consistent. This guides a Rectified Flow backbone to synthesize high-frequency details that are not only realistic but linguistically accurate. Evaluations show that CogSR effectively eliminates ambiguity in severe degradation regimes, making it a robust solution for restoring high-value legacy and surveillance audio.
Primary: National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University
All Institutions: National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, School of Artificial Intelligence, Jianghan University, Xiaomi Corporation
CogSR presents a novel framework for speech super-resolution that effectively combines cognitive reasoning with generative modeling techniques. This innovative approach addresses the critical challenges of restoring high-fidelity audio from low-quality recordings, marking a significant advancement in the field of audio processing and restoration.
The methodology presented in CogSR is innovative, integrating Chain-of-Thought (CoT) reasoning with a Large Audio-Language Model to guide the speech super-resolution process. This cognitive approach is a significant departure from traditional signal mapping techniques, focusing on semantic understanding rather than mere signal reconstruction. The use of explicit acoustic priors further enhances the robustness of the model, ensuring that the generated output maintains speaker identity and natural prosody. The combination of these elements within a Rectified Flow framework demonstrates a thoughtful and comprehensive approach to the challenges of speech restoration in low-quality scenarios.
The experimental setup is rigorous, employing a diverse set of datasets and a well-defined benchmark for evaluation. The authors provide quantitative metrics such as Word Error Rate (WER), Log-Spectral Distance (LSD), and Speaker Similarity (SIM), which effectively showcase the performance improvements of CogSR over existing methods. The inclusion of both objective and subjective evaluations, including a Mean Opinion Score (MOS) study, adds depth to the assessment of the model's capabilities. The results indicate a clear advancement in speech intelligibility and perceptual quality, validating the proposed approach.
The paper provides sufficient details regarding the implementation, including the architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Clear documentation of the datasets and training configurations would enhance the ability of other researchers to replicate the findings.
While the proposed method shows promise, it may still face challenges in extreme degradation scenarios where even semantic cues are insufficient for accurate reconstruction. Additionally, the reliance on a pre-trained Large Audio-Language Model could introduce limitations in terms of generalizability to other domains or languages. The computational requirements for training and inference may also restrict accessibility for broader applications.
The implications of CogSR are significant, particularly in fields such as digital archiving, forensic audio analysis, and media restoration. By improving the quality of restored audio from severely degraded recordings, this research has the potential to enhance the accessibility of historical content and improve the fidelity of critical audio evidence in investigative contexts. The integration of cognitive reasoning into audio processing could inspire further research into semantic-aware models across various domains. CogSR presents a novel framework for speech super-resolution that effectively combines cognitive reasoning with generative modeling techniques. This innovative approach addresses the critical challenges of restoring high-fidelity audio from low-quality recordings, marking a significant advancement in the field of audio processing and restoration.
We present DPDFNet, a causal single-channel speech enhancement model that extends DeepFilterNet2 architecture with dual-path blocks in the encoder, strengthening long-range temporal and cross-band modeling while preserving the original enhancement framework. In addition, we demonstrate that adding a loss component to mitigate over-attenuation in the enhanced speech, combined with a fine-tuning phase tailored for "always-on" applications, leads to substantial improvements in overall model performance. To compare our proposed architecture with a variety of causal open-source models, we created a new evaluation set comprising long, low-SNR recordings in 12 languages across everyday noise scenarios, better reflecting real-world conditions than commonly used benchmarks. On this evaluation set, DPDFNet delivers superior performance to other causal open-source models, including some that are substantially larger and more computationally demanding. We also propose an holistic metric named PRISM, a composite, scale-normalized aggregate of intrusive and non-intrusive metrics, which demonstrates clear scalability with the number of dual-path blocks. We further demonstrate on-device feasibility by deploying DPDFNet on Ceva-NeuPro-Nano edge NPUs. Results indicate that DPDFNet-4, our second-largest model, achieves real-time performance on NPN32 and runs even faster on NPN64, confirming that state-of-the-art quality can be sustained within strict embedded power and latency constraints.
Primary: Ceva Inc.
All Institutions: Ceva Inc.
The main contribution of this paper is the development of DPDFNet, a novel speech enhancement model that effectively integrates dual-path recurrent neural networks into the existing DeepFilterNet2 architecture, achieving superior performance in real-world noisy environments. This work significantly advances the state-of-the-art in single-channel speech enhancement by addressing critical challenges in real-time processing and model robustness.
The methodology presented in DPDFNet is robust, extending the DeepFilterNet2 architecture by integrating dual-path recurrent neural networks (DPRNN) to enhance long-range temporal and cross-band modeling. The addition of an over-attenuation loss function and a fine-tuning phase specifically designed for "always-on" applications demonstrates a thoughtful approach to addressing common challenges in speech enhancement. The paper also introduces a new evaluation set that better reflects real-world conditions, which is a significant improvement over existing benchmarks.
The experimental evaluation is comprehensive, utilizing a new multilingual low-SNR dataset that includes recordings in 12 languages and various everyday noise scenarios. The results indicate that DPDFNet consistently outperforms existing models, including larger and more computationally demanding alternatives. The introduction of the PRISM metric provides a holistic view of model performance, integrating both intrusive and non-intrusive measures effectively.
The paper provides sufficient implementation details, including training protocols, dataset descriptions, and evaluation metrics, which facilitate reproducibility. The availability of code and pretrained models on GitHub further supports this aspect, allowing other researchers to replicate the findings.
One limitation is that while the paper addresses over-attenuation, it does not explore the potential trade-offs between computational efficiency and enhancement quality in depth. Additionally, the real-time performance of the largest model (DPDFNet-8) does not meet real-time constraints, which may limit its applicability in certain scenarios.
The advancements made in DPDFNet have significant implications for real-time speech enhancement applications, particularly in mobile and embedded systems where computational resources are limited. The ability to deploy high-quality models on edge devices could enhance user experiences in various applications, such as telecommunication, virtual assistants, and hearing aids. The main contribution of this paper is the development of DPDFNet, a novel speech enhancement model that effectively integrates dual-path recurrent neural networks into the existing DeepFilterNet2 architecture, achieving superior performance in real-world noisy environments. This work significantly advances the state-of-the-art in single-channel speech enhancement by addressing critical challenges in real-time processing and model robustness.
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
Primary: Zhipu AI
All Institutions: Zhipu AI, Tsinghua University
GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
The methodology of GLM-TTS is well-structured, utilizing a two-stage architecture that effectively combines autoregressive and diffusion models for TTS. The introduction of a multi-reward reinforcement learning framework is particularly innovative, addressing common challenges in TTS systems such as pronunciation accuracy and emotional expressiveness. The use of a hybrid phoneme-text input scheme and optimized speech tokenizer enhances the system's controllability and adaptability, especially for languages with complex phonetic structures like Chinese. The detailed data processing pipeline and the enhancements made to the speech tokenizer demonstrate a thorough understanding of the underlying challenges in TTS.
The experiments conducted are comprehensive, comparing GLM-TTS against state-of-the-art models across various benchmarks. The results indicate that GLM-TTS achieves competitive performance with significantly less training data, which is a notable achievement. The evaluation metrics used, including CER, WER, and SIM, provide a clear picture of the system's capabilities. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper provides a link to the code repository and demo, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific datasets used are somewhat limited. More explicit information on the experimental setup would enhance reproducibility.
One limitation is the reliance on proprietary datasets, which may hinder the generalizability of the results. Additionally, while the system shows promise in emotional expressiveness, the paper acknowledges that the performance may vary across different emotional contexts, indicating potential areas for improvement. The complexity of the model may also pose challenges for deployment in resource-constrained environments.
The GLM-TTS system has significant implications for various applications, including virtual assistants, educational tools, and content creation. Its ability to generate high-fidelity, expressive speech with reduced training data makes it accessible for low-resource scenarios, potentially democratizing TTS technology. The focus on controllability and customization also opens avenues for personalized applications in diverse linguistic contexts. GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
Primary: Indian Institute of Technology Hyderabad
All Institutions: Indian Institute of Technology Hyderabad
The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
The proposed joint multimodal contrastive learning framework effectively integrates audio and text modalities into a unified embedding space, addressing significant limitations of existing Acoustic Word Embedding (AWE) methods. The dual optimization of audio-text and audio-audio contrastive learning is innovative, leveraging the strengths of both modalities while enhancing intra-class compactness and inter-class separation. The methodology is well-structured, with clear explanations of the loss functions and training regime, although further details on hyperparameter tuning could enhance clarity.
The experiments are robust, utilizing the LibriSpeech corpus to evaluate the proposed model against multiple baselines. The performance metrics, including Average Precision (AP) and Equal Error Rates (EER), provide a comprehensive view of the model's capabilities in both Spoken Term Detection (STD) and Keyword Spotting (KWS). The results demonstrate consistent improvements over existing methods, particularly in challenging conditions, which underscores the effectiveness of the proposed approach.
The authors emphasize reproducibility by releasing a standardized evaluation framework and the trial generation recipe alongside their codebase. This commitment to transparency is commendable and facilitates further research in the field. However, more detailed documentation on the training process and hyperparameter settings would be beneficial for full reproducibility.
While the paper presents a significant advancement, it does not extensively discuss the potential computational costs associated with the proposed model, particularly in real-time applications. Additionally, the reliance on the LibriSpeech dataset may limit the generalizability of the findings to other languages or dialects.
The proposed framework has the potential to significantly improve spoken content retrieval systems, making them more robust to variations in speaker and background noise. This advancement could enhance accessibility in various applications, such as voice-activated systems and automated transcription services, thereby contributing to the broader adoption of speech technologies. The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
Primary: South China University of Technology
All Institutions: South China University of Technology, Guangdong Provincial Key Laboratory of AI Large Model and Intelligent Cognition, Engineering Research Centre of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human
The paper presents a novel framework for music emotion recognition that combines a large-scale expert-annotated dataset with an innovative dual-view adaptive learning method. This work significantly contributes to addressing the challenges of data scarcity and feature drift in the field, showcasing the potential for improved emotion recognition in music.
The proposed methodology introduces a comprehensive framework for music emotion recognition, addressing critical challenges such as data scarcity and feature drift. The use of a large-scale, expert-annotated dataset (Memo2496) is a significant advancement, ensuring high-quality annotations through rigorous protocols. The Dual-View Adaptive Music Emotion Recogniser (DAMER) employs innovative modules like Dual Stream Attention Fusion (DSAF) for effective feature interaction, Progressive Confidence Labelling (PCL) for reliable pseudo-label generation, and Style Anchored Memory Learning (SAML) to mitigate cross-track feature drift. This multi-faceted approach demonstrates a thoughtful integration of various techniques, enhancing the robustness of the model.
The experiments conducted on multiple datasets (Memo2496, 1000songs, and PMEmo) validate the effectiveness of the DAMER framework. The reported improvements in arousal dimension accuracy across different datasets underscore the model's generalizability and robustness. The ablation studies provide valuable insights into the contributions of each module, reinforcing the significance of the proposed methods. However, the paper could benefit from more extensive comparisons with a wider range of contemporary methods to fully contextualize its contributions.
The paper provides a clear description of the dataset, methodology, and experimental setup, which facilitates reproducibility. The availability of the dataset and source code on Figshare is a positive aspect, promoting transparency and enabling other researchers to replicate the findings. However, the paper lacks detailed hyperparameter settings and training configurations, which could further enhance reproducibility.
While the paper addresses several key challenges in music emotion recognition, it does not thoroughly discuss potential limitations of the proposed methods. For instance, the reliance on expert annotators, while beneficial for quality, may introduce biases that could affect the generalizability of the dataset. Additionally, the performance improvements, although significant, may not be sufficient for real-world applications where diverse and complex emotional expressions in music are encountered.
The advancements in music emotion recognition have potential applications in various fields, including personalized music recommendation systems, mental health interventions, and immersive entertainment experiences. The introduction of a high-quality dataset and a robust recognition framework can significantly enhance the accuracy and reliability of emotion-based applications in music, contributing to the broader field of affective computing. The paper presents a novel framework for music emotion recognition that combines a large-scale expert-annotated dataset with an innovative dual-view adaptive learning method. This work significantly contributes to addressing the challenges of data scarcity and feature drift in the field, showcasing the potential for improved emotion recognition in music.
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
Primary: South China University of Technology
All Institutions: South China University of Technology, Guangdong Provincial Key Laboratory of AI Large Model and Intelligent Cognition, Engineering Research Centre of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human
This paper presents a significant contribution to the field of music emotion recognition through the introduction of the Memo2496 dataset and the DAMER framework. The innovative methodologies employed address critical challenges in the domain, paving the way for future advancements in affective computing and machine learning applications in music.
The paper introduces a comprehensive framework for music emotion recognition that includes the Memo2496 dataset and the DAMER architecture. The methodology is robust, employing a dual-stream attention mechanism that integrates Mel spectrograms and cochleagrams, enhancing feature fusion through cross-attention. The Progressive Confidence Labelling module effectively addresses the challenges of pseudo-labeling in semi-supervised learning, while the Style-Anchored Memory Learning module mitigates feature drift across different musical styles. The combination of these methods represents a significant advancement in the field, particularly in addressing issues of annotation quality and feature representation.
The experiments conducted on multiple datasets, including Memo2496, 1000songs, and PMEmo, demonstrate the efficacy of the proposed methods. The results indicate that DAMER achieves state-of-the-art performance, with significant improvements in accuracy across various metrics. The ablation studies provide clear evidence of the contributions of each module, reinforcing the validity of the proposed framework. However, the reliance on specific datasets may limit the generalizability of the results.
The authors have made the dataset and source code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions or a comprehensive guide for reproducing the experiments, which could hinder researchers attempting to replicate the study.
One limitation of the study is the potential bias introduced by the expert annotators, as their interpretations of emotional content may not fully represent the broader population's responses to music. Additionally, the focus on instrumental tracks may overlook the complexities introduced by vocal elements in music emotion recognition. The dataset's reliance on specific genres may also limit its applicability to other musical styles.
The findings of this research have significant implications for various applications, including personalized music recommendation systems, mental health interventions, and the development of more nuanced affective computing technologies. The introduction of a high-quality dataset and advanced recognition framework could foster further research in the field and enhance the emotional intelligence of AI systems. This paper presents a significant contribution to the field of music emotion recognition through the introduction of the Memo2496 dataset and the DAMER framework. The innovative methodologies employed address critical challenges in the domain, paving the way for future advancements in affective computing and machine learning applications in music.
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the Audio MultiChallenge benchmark, which provides a novel framework for evaluating end-to-end spoken dialogue systems in realistic multi-turn interactions. This work significantly advances the field by addressing gaps in existing evaluation methodologies and highlighting critical areas for improvement in dialogue system performance.
The methodology presented in the paper is innovative, particularly in its approach to evaluating end-to-end spoken dialogue systems through the Audio MultiChallenge framework. The introduction of the Voice Editing axis and Audio-Cue challenges is a significant advancement, as it addresses the limitations of existing benchmarks that primarily focus on synthetic speech and single-turn tasks. The hybrid audio-native agentic and human-in-the-loop pipeline for curating conversations is a robust method for exposing model failures while maintaining the natural disfluencies found in unscripted speech. However, the paper could benefit from more detailed descriptions of the implementation of these methodologies and their integration into the evaluation framework.
The experimental evaluation is comprehensive, involving a substantial dataset of 452 conversations from 47 speakers, which enhances the realism of the evaluation. The results indicate that even state-of-the-art models struggle with the new axes introduced, particularly in long-range context tracking and audio cue recognition. The reported pass rate of 54.65% for the highest-performing model highlights the challenges faced by current systems. However, the paper could improve by providing more quantitative metrics and comparisons with existing benchmarks to contextualize the results further.
The paper does not provide specific implementation details or access to the datasets used, which raises concerns about reproducibility. While the authors mention that Audio MultiChallenge is open-source, the lack of direct links to the code or datasets limits the ability of other researchers to replicate the study. Clearer documentation and access to resources would significantly enhance reproducibility.
One limitation is the focus on a relatively small dataset, which may not fully capture the diversity of natural human interactions. Additionally, the evaluation primarily targets specific axes of dialogue performance, potentially overlooking other important aspects of conversational AI. The paper also does not address how the models can be improved based on the identified failures, which could provide a pathway for future research.
The introduction of Audio MultiChallenge has the potential to significantly impact the field of spoken dialogue systems by providing a more realistic and comprehensive evaluation framework. This could drive advancements in model development, leading to more robust and effective dialogue systems capable of handling complex, multi-turn interactions. The focus on natural speech patterns and disfluencies is particularly relevant for real-world applications, enhancing the usability of AI in everyday communication. The main contribution of this paper is the introduction of the Audio MultiChallenge benchmark, which provides a novel framework for evaluating end-to-end spoken dialogue systems in realistic multi-turn interactions. This work significantly advances the field by addressing gaps in existing evaluation methodologies and highlighting critical areas for improvement in dialogue system performance.
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Primary: Tsinghua University
All Institutions: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua University, Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University
The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
The methodology for constructing the HQ-MPSD dataset is robust and innovative. It employs a three-stage process for generating partial deepfake speech that emphasizes linguistic coherence and acoustic fidelity. The use of fine-grained forced alignment for splice points and the normalization of loudness and spectral characteristics are noteworthy techniques that enhance the quality of the dataset. Additionally, the incorporation of background effects to simulate real-world conditions is a significant improvement over existing datasets. The careful design choices made to minimize artifacts and ensure natural transitions contribute to the dataset's overall quality and applicability for training detection models.
The experiments conducted using HQ-MPSD are comprehensive and well-structured. The cross-language and cross-dataset evaluations provide valuable insights into the generalization capabilities of state-of-the-art detection models. The performance drop observed in existing models when tested on HQ-MPSD highlights the dataset's effectiveness in revealing the limitations of current methodologies. The use of metrics such as Equal Error Rate (EER) and Area Under the Curve (AUC) for evaluation is appropriate and provides a clear understanding of model performance.
The paper provides sufficient detail regarding the dataset generation process and experimental setup, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The dataset itself is accessible, which is a positive aspect for researchers looking to build upon this work.
While the dataset is a significant advancement, it may still have limitations regarding the diversity of accents and dialects within the eight languages represented. Additionally, the reliance on forced alignment may introduce its own biases, particularly if the alignment tools are not perfectly accurate. The paper does not address potential ethical concerns related to the misuse of deepfake technology, which is an important consideration in this field.
The development of HQ-MPSD has the potential to significantly advance the field of deepfake detection by providing a high-quality, multilingual benchmark that can improve the robustness of detection models. The dataset's design encourages the exploration of genuine manipulation cues rather than superficial artifacts, which can lead to more effective solutions in real-world applications. This work is particularly relevant in the context of misinformation and security, where the ability to detect partial deepfake speech can have substantial societal implications. The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
The proposed methodology of DisCo-Speech is innovative, focusing on disentangling speech attributes into content, prosody, and timbre through a two-stage training paradigm. The tri-factor disentanglement approach is a significant advancement over existing methods, allowing for independent control over speech generation. The use of hybrid losses and parallel encoders is well-justified, addressing the disentanglement-reconstruction trade-off effectively. The integration of a standard LM for prosodic continuation and a specialized decoder for waveform synthesis is a thoughtful design choice that enhances the flexibility of the system.
The experimental evaluation is thorough, utilizing a diverse dataset and comparing DisCo-Speech against state-of-the-art models. The results demonstrate competitive performance in voice cloning and prosody control, with clear metrics provided for reconstruction quality and controllability. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings. However, more extensive comparisons with a broader range of existing methods could provide deeper insights into its relative performance.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. The authors also mention plans to release code and weights, which is essential for enabling other researchers to validate the findings and build upon the work. However, the absence of specific details about the training data and preprocessing steps could hinder full reproducibility.
The paper acknowledges limitations, including lower speaker similarity compared to multi-stage systems and potential instability in generating exaggerated prosody. The delicate balance between disentanglement and reconstruction fidelity is also highlighted as an ongoing challenge. These limitations suggest areas for future improvement, particularly in enhancing the expressive range and fidelity of the generated speech.
The advancements presented in DisCo-Speech have significant implications for applications in human-computer interaction, entertainment, and accessibility technologies. The ability to generate speech with controlled prosody and timbre could enhance user experience in virtual assistants, audiobooks, and language learning tools. Furthermore, the framework's potential for zero-shot learning could democratize access to high-quality speech synthesis across diverse languages and dialects. The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.
Primary: unknown
All Institutions: equal contribution
The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that effectively disentangles speech attributes, allowing for independent control over speaker timbre and prosody. This work represents a meaningful advancement in the field of text-to-speech synthesis, providing a robust foundation for future research and applications in controllable speech generation.
The methodology presented in DisCo-Speech is innovative, particularly in its approach to disentangling speech attributes through a two-stage training paradigm. The tri-factor disentanglement effectively separates content, prosody, and timbre, which is a significant advancement over existing methods that often struggle with this entanglement. The use of hybrid constraints to ensure robust disentanglement while maintaining reconstruction quality is a notable strength. However, the complexity of the model and the reliance on extensive training data may limit its accessibility and applicability in resource-constrained environments.
The experiments conducted are thorough, comparing DisCo-Speech against a range of state-of-the-art models across multiple dimensions, including voice cloning and prosody control. The results demonstrate competitive performance, particularly in maintaining speaker timbre while allowing for independent prosody control. The use of both objective and subjective metrics enhances the credibility of the findings. However, the paper could benefit from a more detailed discussion on the statistical significance of the results and the robustness of the evaluations.
The paper includes a commitment to release code and weights, which is essential for reproducibility. However, the detailed implementation specifics, such as hyperparameter settings and training configurations, are somewhat scattered throughout the text. A consolidated section summarizing these details would improve clarity and facilitate replication of the results.
The authors acknowledge limitations related to speaker similarity, which is lower than some multi-stage systems. Additionally, the quality of the training corpus may restrict the model's performance in generating highly exaggerated prosody. The delicate balance between disentanglement and reconstruction fidelity is also a noted challenge, suggesting areas for future improvement.
The potential applications of DisCo-Speech are significant, particularly in areas requiring high-quality, controllable speech synthesis, such as virtual assistants, audiobooks, and entertainment. By enabling zero-shot controllable speech generation, this work could enhance user interaction and personalization in various applications. However, ethical considerations regarding voice cloning and the potential for misuse in generating deceptive audio content should be addressed. The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that effectively disentangles speech attributes, allowing for independent control over speaker timbre and prosody. This work represents a meaningful advancement in the field of text-to-speech synthesis, providing a robust foundation for future research and applications in controllable speech generation.