This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
The proposed All-in-One ASR framework introduces a multi-mode joiner that effectively integrates CTC, AED, and Transducer models into a single architecture. This unification is significant as it reduces the model footprint and computational overhead while maintaining or improving recognition performance. The methodology is well-structured, leveraging joint training and decoding strategies to exploit the strengths of different ASR paradigms without the need for separate decoder branches. The use of a shared encoder and the innovative joiner mechanism are noteworthy contributions that address the challenges of model complexity and resource efficiency in ASR systems.
The experimental evaluation is robust, utilizing well-established datasets such as TED-LIUM and LibriSpeech to demonstrate the effectiveness of the All-in-One ASR framework. The results indicate that the proposed model not only matches but often surpasses the performance of individually optimized models across various ASR tasks. The paper provides detailed comparisons and ablation studies that substantiate the claims of improved performance and reduced model size, showcasing the framework's versatility in both offline and streaming modes.
While the paper outlines the architecture and training procedures in detail, it lacks specific URLs or repositories for code and datasets, which could hinder reproducibility. The absence of a public demo or project page further limits the ability of other researchers to replicate the results. However, the comprehensive description of the methodologies and experimental setups provides a solid foundation for future implementations.
One limitation is the potential complexity introduced by the multi-mode joiner, which may require careful tuning of hyperparameters to achieve optimal performance across different ASR tasks. Additionally, the paper does not address the implications of scaling this framework to more complex or diverse ASR tasks beyond those tested. The reliance on specific datasets may also limit the generalizability of the findings.
The All-in-One ASR framework has significant implications for the deployment of ASR systems in resource-constrained environments, such as mobile devices or embedded systems, where model size and computational efficiency are critical. By unifying multiple ASR paradigms, this approach could streamline the development process and reduce costs, making advanced speech recognition technology more accessible across various applications. The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Primary: Tsinghua University
All Institutions: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua University, Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University
The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
The methodology for constructing the HQ-MPSD dataset is robust and innovative. It employs a three-stage process for generating partial deepfake speech that emphasizes linguistic coherence and acoustic fidelity. The use of fine-grained forced alignment for splice points and the normalization of loudness and spectral characteristics are noteworthy techniques that enhance the quality of the dataset. Additionally, the incorporation of background effects to simulate real-world conditions is a significant improvement over existing datasets. The careful design choices made to minimize artifacts and ensure natural transitions contribute to the dataset's overall quality and applicability for training detection models.
The experiments conducted using HQ-MPSD are comprehensive and well-structured. The cross-language and cross-dataset evaluations provide valuable insights into the generalization capabilities of state-of-the-art detection models. The performance drop observed in existing models when tested on HQ-MPSD highlights the dataset's effectiveness in revealing the limitations of current methodologies. The use of metrics such as Equal Error Rate (EER) and Area Under the Curve (AUC) for evaluation is appropriate and provides a clear understanding of model performance.
The paper provides sufficient detail regarding the dataset generation process and experimental setup, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The dataset itself is accessible, which is a positive aspect for researchers looking to build upon this work.
While the dataset is a significant advancement, it may still have limitations regarding the diversity of accents and dialects within the eight languages represented. Additionally, the reliance on forced alignment may introduce its own biases, particularly if the alignment tools are not perfectly accurate. The paper does not address potential ethical concerns related to the misuse of deepfake technology, which is an important consideration in this field.
The development of HQ-MPSD has the potential to significantly advance the field of deepfake detection by providing a high-quality, multilingual benchmark that can improve the robustness of detection models. The dataset's design encourages the exploration of genuine manipulation cues rather than superficial artifacts, which can lead to more effective solutions in real-world applications. This work is particularly relevant in the context of misinformation and security, where the ability to detect partial deepfake speech can have substantial societal implications. The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
Primary: Zhipu AI
All Institutions: Zhipu AI, Tsinghua University
GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
The methodology of GLM-TTS is well-structured, utilizing a two-stage architecture that effectively combines autoregressive and diffusion models for TTS. The introduction of a multi-reward reinforcement learning framework is particularly innovative, addressing common challenges in TTS systems such as pronunciation accuracy and emotional expressiveness. The use of a hybrid phoneme-text input scheme and optimized speech tokenizer enhances the system's controllability and adaptability, especially for languages with complex phonetic structures like Chinese. The detailed data processing pipeline and the enhancements made to the speech tokenizer demonstrate a thorough understanding of the underlying challenges in TTS.
The experiments conducted are comprehensive, comparing GLM-TTS against state-of-the-art models across various benchmarks. The results indicate that GLM-TTS achieves competitive performance with significantly less training data, which is a notable achievement. The evaluation metrics used, including CER, WER, and SIM, provide a clear picture of the system's capabilities. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper provides a link to the code repository and demo, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific datasets used are somewhat limited. More explicit information on the experimental setup would enhance reproducibility.
One limitation is the reliance on proprietary datasets, which may hinder the generalizability of the results. Additionally, while the system shows promise in emotional expressiveness, the paper acknowledges that the performance may vary across different emotional contexts, indicating potential areas for improvement. The complexity of the model may also pose challenges for deployment in resource-constrained environments.
The GLM-TTS system has significant implications for various applications, including virtual assistants, educational tools, and content creation. Its ability to generate high-fidelity, expressive speech with reduced training data makes it accessible for low-resource scenarios, potentially democratizing TTS technology. The focus on controllability and customization also opens avenues for personalized applications in diverse linguistic contexts. GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
Primary: Zhipu AI
All Institutions: Zhipu AI, Tsinghua University
GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
The methodology of GLM-TTS is well-structured, utilizing a two-stage architecture that effectively combines autoregressive and diffusion models for TTS. The introduction of a multi-reward reinforcement learning framework is particularly innovative, addressing common challenges in TTS systems such as pronunciation accuracy and emotional expressiveness. The use of a hybrid phoneme-text input scheme and optimized speech tokenizer enhances the system's controllability and adaptability, especially for languages with complex phonetic structures like Chinese. The detailed data processing pipeline and the enhancements made to the speech tokenizer demonstrate a thorough understanding of the underlying challenges in TTS.
The experiments conducted are comprehensive, comparing GLM-TTS against state-of-the-art models across various benchmarks. The results indicate that GLM-TTS achieves competitive performance with significantly less training data, which is a notable achievement. The evaluation metrics used, including CER, WER, and SIM, provide a clear picture of the system's capabilities. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper provides a link to the code repository and demo, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific datasets used are somewhat limited. More explicit information on the experimental setup would enhance reproducibility.
One limitation is the reliance on proprietary datasets, which may hinder the generalizability of the results. Additionally, while the system shows promise in emotional expressiveness, the paper acknowledges that the performance may vary across different emotional contexts, indicating potential areas for improvement. The complexity of the model may also pose challenges for deployment in resource-constrained environments.
The GLM-TTS system has significant implications for various applications, including virtual assistants, educational tools, and content creation. Its ability to generate high-fidelity, expressive speech with reduced training data makes it accessible for low-resource scenarios, potentially democratizing TTS technology. The focus on controllability and customization also opens avenues for personalized applications in diverse linguistic contexts. GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
Primary: Indian Institute of Technology Hyderabad
All Institutions: Indian Institute of Technology Hyderabad
The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
The proposed joint multimodal contrastive learning framework effectively integrates audio and text modalities into a unified embedding space, addressing significant limitations of existing Acoustic Word Embedding (AWE) methods. The dual optimization of audio-text and audio-audio contrastive learning is innovative, leveraging the strengths of both modalities while enhancing intra-class compactness and inter-class separation. The methodology is well-structured, with clear explanations of the loss functions and training regime, although further details on hyperparameter tuning could enhance clarity.
The experiments are robust, utilizing the LibriSpeech corpus to evaluate the proposed model against multiple baselines. The performance metrics, including Average Precision (AP) and Equal Error Rates (EER), provide a comprehensive view of the model's capabilities in both Spoken Term Detection (STD) and Keyword Spotting (KWS). The results demonstrate consistent improvements over existing methods, particularly in challenging conditions, which underscores the effectiveness of the proposed approach.
The authors emphasize reproducibility by releasing a standardized evaluation framework and the trial generation recipe alongside their codebase. This commitment to transparency is commendable and facilitates further research in the field. However, more detailed documentation on the training process and hyperparameter settings would be beneficial for full reproducibility.
While the paper presents a significant advancement, it does not extensively discuss the potential computational costs associated with the proposed model, particularly in real-time applications. Additionally, the reliance on the LibriSpeech dataset may limit the generalizability of the findings to other languages or dialects.
The proposed framework has the potential to significantly improve spoken content retrieval systems, making them more robust to variations in speaker and background noise. This advancement could enhance accessibility in various applications, such as voice-activated systems and automated transcription services, thereby contributing to the broader adoption of speech technologies. The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
Primary: South China University of Technology
All Institutions: South China University of Technology, Guangdong Provincial Key Laboratory of AI Large Model and Intelligent Cognition, Engineering Research Centre of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human
The paper presents a novel framework for music emotion recognition that combines a large-scale expert-annotated dataset with an innovative dual-view adaptive learning method. This work significantly contributes to addressing the challenges of data scarcity and feature drift in the field, showcasing the potential for improved emotion recognition in music.
The proposed methodology introduces a comprehensive framework for music emotion recognition, addressing critical challenges such as data scarcity and feature drift. The use of a large-scale, expert-annotated dataset (Memo2496) is a significant advancement, ensuring high-quality annotations through rigorous protocols. The Dual-View Adaptive Music Emotion Recogniser (DAMER) employs innovative modules like Dual Stream Attention Fusion (DSAF) for effective feature interaction, Progressive Confidence Labelling (PCL) for reliable pseudo-label generation, and Style Anchored Memory Learning (SAML) to mitigate cross-track feature drift. This multi-faceted approach demonstrates a thoughtful integration of various techniques, enhancing the robustness of the model.
The experiments conducted on multiple datasets (Memo2496, 1000songs, and PMEmo) validate the effectiveness of the DAMER framework. The reported improvements in arousal dimension accuracy across different datasets underscore the model's generalizability and robustness. The ablation studies provide valuable insights into the contributions of each module, reinforcing the significance of the proposed methods. However, the paper could benefit from more extensive comparisons with a wider range of contemporary methods to fully contextualize its contributions.
The paper provides a clear description of the dataset, methodology, and experimental setup, which facilitates reproducibility. The availability of the dataset and source code on Figshare is a positive aspect, promoting transparency and enabling other researchers to replicate the findings. However, the paper lacks detailed hyperparameter settings and training configurations, which could further enhance reproducibility.
While the paper addresses several key challenges in music emotion recognition, it does not thoroughly discuss potential limitations of the proposed methods. For instance, the reliance on expert annotators, while beneficial for quality, may introduce biases that could affect the generalizability of the dataset. Additionally, the performance improvements, although significant, may not be sufficient for real-world applications where diverse and complex emotional expressions in music are encountered.
The advancements in music emotion recognition have potential applications in various fields, including personalized music recommendation systems, mental health interventions, and immersive entertainment experiences. The introduction of a high-quality dataset and a robust recognition framework can significantly enhance the accuracy and reliability of emotion-based applications in music, contributing to the broader field of affective computing. The paper presents a novel framework for music emotion recognition that combines a large-scale expert-annotated dataset with an innovative dual-view adaptive learning method. This work significantly contributes to addressing the challenges of data scarcity and feature drift in the field, showcasing the potential for improved emotion recognition in music.
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
Primary: South China University of Technology
All Institutions: South China University of Technology, Guangdong Provincial Key Laboratory of AI Large Model and Intelligent Cognition, Engineering Research Centre of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human
This paper presents a significant contribution to the field of music emotion recognition through the introduction of the Memo2496 dataset and the DAMER framework. The innovative methodologies employed address critical challenges in the domain, paving the way for future advancements in affective computing and machine learning applications in music.
The paper introduces a comprehensive framework for music emotion recognition that includes the Memo2496 dataset and the DAMER architecture. The methodology is robust, employing a dual-stream attention mechanism that integrates Mel spectrograms and cochleagrams, enhancing feature fusion through cross-attention. The Progressive Confidence Labelling module effectively addresses the challenges of pseudo-labeling in semi-supervised learning, while the Style-Anchored Memory Learning module mitigates feature drift across different musical styles. The combination of these methods represents a significant advancement in the field, particularly in addressing issues of annotation quality and feature representation.
The experiments conducted on multiple datasets, including Memo2496, 1000songs, and PMEmo, demonstrate the efficacy of the proposed methods. The results indicate that DAMER achieves state-of-the-art performance, with significant improvements in accuracy across various metrics. The ablation studies provide clear evidence of the contributions of each module, reinforcing the validity of the proposed framework. However, the reliance on specific datasets may limit the generalizability of the results.
The authors have made the dataset and source code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation instructions or a comprehensive guide for reproducing the experiments, which could hinder researchers attempting to replicate the study.
One limitation of the study is the potential bias introduced by the expert annotators, as their interpretations of emotional content may not fully represent the broader population's responses to music. Additionally, the focus on instrumental tracks may overlook the complexities introduced by vocal elements in music emotion recognition. The dataset's reliance on specific genres may also limit its applicability to other musical styles.
The findings of this research have significant implications for various applications, including personalized music recommendation systems, mental health interventions, and the development of more nuanced affective computing technologies. The introduction of a high-quality dataset and advanced recognition framework could foster further research in the field and enhance the emotional intelligence of AI systems. This paper presents a significant contribution to the field of music emotion recognition through the introduction of the Memo2496 dataset and the DAMER framework. The innovative methodologies employed address critical challenges in the domain, paving the way for future advancements in affective computing and machine learning applications in music.
End-to-end (E2E) spoken dialogue systems are increasingly replacing cascaded pipelines for voice-based human-AI interaction, processing raw audio directly without intermediate transcription. Existing benchmarks primarily evaluate these models on synthetic speech and single-turn tasks, leaving realistic multi-turn conversational ability underexplored. We introduce Audio MultiChallenge, an open-source benchmark to evaluate E2E spoken dialogue systems under natural multi-turn interaction patterns. Building on the text-based MultiChallenge framework, which evaluates Inference Memory, Instruction Retention, and Self Coherence, we introduce a new axis Voice Editing that tests robustness to mid-utterance speech repairs and backtracking. We further augment each axis to the audio modality, such as introducing Audio-Cue challenges for Inference Memory that require recalling ambient sounds and paralinguistic signals beyond semantic content. We curate 452 conversations from 47 speakers with 1,712 instance-specific rubrics through a hybrid audio-native agentic and human-in-the-loop pipeline that exposes model failures at scale while preserving natural disfluencies found in unscripted human speech. Our evaluation of proprietary and open-source models reveals that even frontier models struggle on our benchmark, with Gemini 3 Pro Preview (Thinking), our highest-performing model achieving a 54.65% pass rate. Error analysis shows that models fail most often on our new axes and that Self Coherence degrades with longer audio context. These failures reflect difficulty of tracking edits, audio cues, and long-range context in natural spoken dialogue. Audio MultiChallenge provides a reproducible testbed to quantify them and drive improvements in audio-native multi-turn interaction capability.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the Audio MultiChallenge benchmark, which provides a novel framework for evaluating end-to-end spoken dialogue systems in realistic multi-turn interactions. This work significantly advances the field by addressing gaps in existing evaluation methodologies and highlighting critical areas for improvement in dialogue system performance.
The methodology presented in the paper is innovative, particularly in its approach to evaluating end-to-end spoken dialogue systems through the Audio MultiChallenge framework. The introduction of the Voice Editing axis and Audio-Cue challenges is a significant advancement, as it addresses the limitations of existing benchmarks that primarily focus on synthetic speech and single-turn tasks. The hybrid audio-native agentic and human-in-the-loop pipeline for curating conversations is a robust method for exposing model failures while maintaining the natural disfluencies found in unscripted speech. However, the paper could benefit from more detailed descriptions of the implementation of these methodologies and their integration into the evaluation framework.
The experimental evaluation is comprehensive, involving a substantial dataset of 452 conversations from 47 speakers, which enhances the realism of the evaluation. The results indicate that even state-of-the-art models struggle with the new axes introduced, particularly in long-range context tracking and audio cue recognition. The reported pass rate of 54.65% for the highest-performing model highlights the challenges faced by current systems. However, the paper could improve by providing more quantitative metrics and comparisons with existing benchmarks to contextualize the results further.
The paper does not provide specific implementation details or access to the datasets used, which raises concerns about reproducibility. While the authors mention that Audio MultiChallenge is open-source, the lack of direct links to the code or datasets limits the ability of other researchers to replicate the study. Clearer documentation and access to resources would significantly enhance reproducibility.
One limitation is the focus on a relatively small dataset, which may not fully capture the diversity of natural human interactions. Additionally, the evaluation primarily targets specific axes of dialogue performance, potentially overlooking other important aspects of conversational AI. The paper also does not address how the models can be improved based on the identified failures, which could provide a pathway for future research.
The introduction of Audio MultiChallenge has the potential to significantly impact the field of spoken dialogue systems by providing a more realistic and comprehensive evaluation framework. This could drive advancements in model development, leading to more robust and effective dialogue systems capable of handling complex, multi-turn interactions. The focus on natural speech patterns and disfluencies is particularly relevant for real-world applications, enhancing the usability of AI in everyday communication. The main contribution of this paper is the introduction of the Audio MultiChallenge benchmark, which provides a novel framework for evaluating end-to-end spoken dialogue systems in realistic multi-turn interactions. This work significantly advances the field by addressing gaps in existing evaluation methodologies and highlighting critical areas for improvement in dialogue system performance.
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Primary: Tsinghua University
All Institutions: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua University, Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University
The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
The methodology for constructing the HQ-MPSD dataset is robust and innovative. It employs a three-stage process for generating partial deepfake speech that emphasizes linguistic coherence and acoustic fidelity. The use of fine-grained forced alignment for splice points and the normalization of loudness and spectral characteristics are noteworthy techniques that enhance the quality of the dataset. Additionally, the incorporation of background effects to simulate real-world conditions is a significant improvement over existing datasets. The careful design choices made to minimize artifacts and ensure natural transitions contribute to the dataset's overall quality and applicability for training detection models.
The experiments conducted using HQ-MPSD are comprehensive and well-structured. The cross-language and cross-dataset evaluations provide valuable insights into the generalization capabilities of state-of-the-art detection models. The performance drop observed in existing models when tested on HQ-MPSD highlights the dataset's effectiveness in revealing the limitations of current methodologies. The use of metrics such as Equal Error Rate (EER) and Area Under the Curve (AUC) for evaluation is appropriate and provides a clear understanding of model performance.
The paper provides sufficient detail regarding the dataset generation process and experimental setup, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The dataset itself is accessible, which is a positive aspect for researchers looking to build upon this work.
While the dataset is a significant advancement, it may still have limitations regarding the diversity of accents and dialects within the eight languages represented. Additionally, the reliance on forced alignment may introduce its own biases, particularly if the alignment tools are not perfectly accurate. The paper does not address potential ethical concerns related to the misuse of deepfake technology, which is an important consideration in this field.
The development of HQ-MPSD has the potential to significantly advance the field of deepfake detection by providing a high-quality, multilingual benchmark that can improve the robustness of detection models. The dataset's design encourages the exploration of genuine manipulation cues rather than superficial artifacts, which can lead to more effective solutions in real-world applications. This work is particularly relevant in the context of misinformation and security, where the ability to detect partial deepfake speech can have substantial societal implications. The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
The proposed methodology of DisCo-Speech is innovative, focusing on disentangling speech attributes into content, prosody, and timbre through a two-stage training paradigm. The tri-factor disentanglement approach is a significant advancement over existing methods, allowing for independent control over speech generation. The use of hybrid losses and parallel encoders is well-justified, addressing the disentanglement-reconstruction trade-off effectively. The integration of a standard LM for prosodic continuation and a specialized decoder for waveform synthesis is a thoughtful design choice that enhances the flexibility of the system.
The experimental evaluation is thorough, utilizing a diverse dataset and comparing DisCo-Speech against state-of-the-art models. The results demonstrate competitive performance in voice cloning and prosody control, with clear metrics provided for reconstruction quality and controllability. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings. However, more extensive comparisons with a broader range of existing methods could provide deeper insights into its relative performance.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. The authors also mention plans to release code and weights, which is essential for enabling other researchers to validate the findings and build upon the work. However, the absence of specific details about the training data and preprocessing steps could hinder full reproducibility.
The paper acknowledges limitations, including lower speaker similarity compared to multi-stage systems and potential instability in generating exaggerated prosody. The delicate balance between disentanglement and reconstruction fidelity is also highlighted as an ongoing challenge. These limitations suggest areas for future improvement, particularly in enhancing the expressive range and fidelity of the generated speech.
The advancements presented in DisCo-Speech have significant implications for applications in human-computer interaction, entertainment, and accessibility technologies. The ability to generate speech with controlled prosody and timbre could enhance user experience in virtual assistants, audiobooks, and language learning tools. Furthermore, the framework's potential for zero-shot learning could democratize access to high-quality speech synthesis across diverse languages and dialects. The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
The proposed All-in-One ASR framework introduces a multi-mode joiner that effectively integrates CTC, AED, and Transducer models into a single architecture. This unification is significant as it reduces the model footprint and computational overhead while maintaining or improving recognition performance. The methodology is well-structured, leveraging joint training and decoding strategies to exploit the strengths of different ASR paradigms without the need for separate decoder branches. The use of a shared encoder and the innovative joiner mechanism are noteworthy contributions that address the challenges of model complexity and resource efficiency in ASR systems.
The experimental evaluation is robust, utilizing well-established datasets such as TED-LIUM and LibriSpeech to demonstrate the effectiveness of the All-in-One ASR framework. The results indicate that the proposed model not only matches but often surpasses the performance of individually optimized models across various ASR tasks. The paper provides detailed comparisons and ablation studies that substantiate the claims of improved performance and reduced model size, showcasing the framework's versatility in both offline and streaming modes.
While the paper outlines the architecture and training procedures in detail, it lacks specific URLs or repositories for code and datasets, which could hinder reproducibility. The absence of a public demo or project page further limits the ability of other researchers to replicate the results. However, the comprehensive description of the methodologies and experimental setups provides a solid foundation for future implementations.
One limitation is the potential complexity introduced by the multi-mode joiner, which may require careful tuning of hyperparameters to achieve optimal performance across different ASR tasks. Additionally, the paper does not address the implications of scaling this framework to more complex or diverse ASR tasks beyond those tested. The reliance on specific datasets may also limit the generalizability of the findings.
The All-in-One ASR framework has significant implications for the deployment of ASR systems in resource-constrained environments, such as mobile devices or embedded systems, where model size and computational efficiency are critical. By unifying multiple ASR paradigms, this approach could streamline the development process and reduce costs, making advanced speech recognition technology more accessible across various applications. The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses variable-length polyphonic note sequences into compact 64-dimensional phrase-level representations with high reconstruction fidelity, allowing efficient training and a well-structured latent space. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes in 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
The paper introduces PhraseVAE and PhraseLDM, which leverage latent diffusion for symbolic music generation. The methodology is innovative as it compresses polyphonic note sequences into a structured latent space, allowing for efficient training and generation. The use of phrase-level representations instead of note-attribute tokens is a significant shift that addresses limitations in existing models. However, the details on the training process and the specific architecture of the latent diffusion model could be elaborated further to enhance understanding.
The experiments demonstrate the framework's ability to generate full songs with coherent structure and idiomatic instrument patterns. The evaluation metrics used to assess musical quality and generation diversity are not explicitly detailed, which could limit the assessment of the model's performance. The ability to generate 128 bars of music in a single pass is a notable achievement, indicating a strong technical contribution.
The paper does not provide sufficient details on the implementation or datasets used for training and evaluation, which raises concerns about reproducibility. Including a code repository or supplementary materials would greatly enhance the reproducibility of the results.
One limitation is the lack of detailed evaluation metrics and comparisons with existing state-of-the-art models. Additionally, while the model can generate music quickly, the paper does not discuss potential challenges in ensuring the musicality and creativity of the generated pieces over longer sequences.
The proposed framework has the potential to significantly advance the field of symbolic music generation, encouraging researchers to explore phrase-level modeling. This could lead to more sophisticated music generation systems that better capture the nuances of musical composition. The approach may also inspire applications in interactive music systems and automated composition tools. The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses an arbitrary variable-length polyphonic note sequence into a single compact 64-dimensional phrase-level latent representation with high reconstruction fidelity, allowing a well-structured latent space and efficient generative modeling. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes at 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PhraseVAE and PhraseLDM, which provide a novel latent diffusion approach for full-song multitrack symbolic music generation. This work represents a significant advancement in addressing the limitations of existing models, particularly in handling long sequences and maintaining musical coherence, thereby paving the way for future research in the domain.
The methodology introduced in this paper is innovative, leveraging a latent diffusion framework specifically tailored for symbolic music generation. The PhraseVAE component effectively compresses polyphonic note sequences into a compact latent representation, addressing the challenges of long sequences and limited context. The PhraseLDM builds on this representation to generate full songs in a single pass, which is a significant departure from traditional autoregressive models. The approach is well-structured, and the authors provide a clear rationale for their design choices, although further details on the training process and hyperparameter tuning would enhance the understanding of the methodology.
The experimental section demonstrates the capabilities of the proposed models through various metrics, including musical quality and generation diversity. The authors report that their framework can generate complete songs quickly while maintaining coherence and structure, which is a notable achievement. However, the paper would benefit from a more extensive comparison with existing state-of-the-art models to substantiate the claims of superiority in musical quality and efficiency.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. Key aspects such as the dataset used, specific training procedures, and evaluation metrics are not thoroughly detailed. Providing a code repository or supplementary materials would significantly improve the reproducibility of the results.
One limitation of the study is the lack of a comprehensive evaluation against a broader range of existing symbolic music generation models. Additionally, the reliance on a single latent representation may not capture all the nuances of music composition, potentially limiting the expressiveness of the generated pieces. The authors also do not address how the model performs with different genres or styles of music.
The proposed framework has the potential to significantly influence the field of symbolic music generation, encouraging researchers to explore phrase-level modeling over traditional note-attribute approaches. This could lead to advancements in music composition tools, AI-assisted music creation, and educational applications in music theory. The implications of generating high-quality music efficiently could also extend to various entertainment and media industries. The main contribution of this paper is the introduction of PhraseVAE and PhraseLDM, which provide a novel latent diffusion approach for full-song multitrack symbolic music generation. This work represents a significant advancement in addressing the limitations of existing models, particularly in handling long sequences and maintaining musical coherence, thereby paving the way for future research in the domain.