The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.
Primary: Fudan University
All Institutions: Fudan University
The paper presents Sofia, a flexible framework for Synthetic Song Detection that effectively models music-intrinsic features, significantly improving detection accuracy and robustness against audio perturbations. The comprehensive methodology and the introduction of the MUSIC8K benchmark position this work as a meaningful contribution to the field of audio machine learning.
The paper introduces Sofia, a novel framework for Synthetic Song Detection (SSD) that utilizes music-intrinsic features through a Mixture-of-Experts (MoE) architecture. This approach allows for flexible feature combinations and adaptive learning, addressing the limitations of existing SSD methods that rely on low-level artifacts. The methodology is well-structured, with a clear problem setup, feature extraction, and fusion strategies that enhance generalization across different music generators. The use of feature-specific experts to capture distinct musical attributes is innovative and adds depth to the detection process.
The authors construct the MUSIC8K benchmark, which is a significant contribution to the field, providing a dataset that includes the latest music generators and realistic audio perturbations. The experiments demonstrate that Sofia significantly outperforms existing methods, improving the F1 score by 18.5 points over the strongest baseline. The evaluation metrics used, including F1 score and accuracy, are appropriate for the task and provide a clear picture of the framework's performance.
The paper provides detailed implementation information, including training configurations, audio preprocessing, and network architecture. However, the absence of a publicly accessible project URL or demo limits the reproducibility of the results. Future work should consider making the code and data available to facilitate validation by other researchers.
One limitation is the reliance on specific music features, which may not generalize well as music generation technology evolves. The authors acknowledge that the current features might become outdated, suggesting a need for continuous updates to the framework. Additionally, the framework's performance on less common genres or styles of music is not thoroughly tested.
The implications of this research are significant, as it addresses the growing concern of synthetic music proliferation. The ability to detect synthetic songs reliably can have applications in copyright enforcement, music quality control, and the preservation of artistic integrity. Furthermore, the framework's adaptability could lead to advancements in other areas of audio analysis and detection. The paper presents Sofia, a flexible framework for Synthetic Song Detection that effectively models music-intrinsic features, significantly improving detection accuracy and robustness against audio perturbations. The comprehensive methodology and the introduction of the MUSIC8K benchmark position this work as a meaningful contribution to the field of audio machine learning.
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.
Primary: Institute of Software, Chinese Academy of Sciences
All Institutions: Institute of Software, Chinese Academy of Sciences, National Research Council Canada, The Chinese University of Hong Kong
The paper presents a novel online speaker adaptation method that leverages audio-textual prompts for elderly speech recognition. The technical contributions are significant, addressing critical challenges in the field and demonstrating substantial improvements in performance metrics, which could lead to enhanced user experiences in real-world applications.
The proposed methodology introduces a novel approach to speaker adaptation in elderly speech recognition by leveraging cross-utterance audio-textual prompts. This dual cross-modality fusion effectively captures both acoustic and language deficiencies, addressing the unique challenges posed by elderly speech. The use of a Q-Former for compressing variable-length history information is innovative and enhances the model's capability to adapt in real-time. However, the paper could benefit from more detailed explanations of the fusion strategies and the Q-Former architecture.
The experiments are well-structured, utilizing two relevant datasets (DementiaBank Pitt and JCCOCC MoCA) that are appropriate for the target demographic. The reported results demonstrate statistically significant improvements in WER and CER, showcasing the effectiveness of the proposed method over traditional speaker-independent models. The inclusion of ablation studies strengthens the findings, although further exploration of the impact of different configurations could provide deeper insights.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is described, specific hyperparameters, training procedures, and code availability are not mentioned, which could hinder reproducibility.
One limitation is the reliance on specific datasets, which may not generalize to all elderly populations or languages. Additionally, the method's performance in real-world scenarios with diverse acoustic environments remains untested. The paper also does not address potential ethical considerations related to the use of AI in sensitive applications like elderly care.
This research has significant implications for improving communication and social engagement among the elderly, particularly for those with speech impairments. By enhancing speech recognition systems to better accommodate elderly speakers, the work could contribute to advancements in assistive technologies and healthcare applications. The paper presents a novel online speaker adaptation method that leverages audio-textual prompts for elderly speech recognition. The technical contributions are significant, addressing critical challenges in the field and demonstrating substantial improvements in performance metrics, which could lead to enhanced user experiences in real-world applications.
This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous representations and discrete tokens in latent space, respectively. Theoretical analysis and visualisations in latent space are performed to exhibit their inherent modelling mechanisms. Experimental results show that the fully fine-tuned cNAC-SE model consistently outperforms all dNAC-SE variants across diverse test conditions and achieves leading performance among established generative approaches in DNS-MOS metrics. Comparison with the discriminative counterpart shows that VQ enhances robustness through an intrinsic effect of clean-prior-constrained regularisation, independent of discrete token processing. This highlights the transferable value of VQ regularisation to other continuous modelling methods.
Primary: Ghent University - imec
All Institutions: Ghent University - imec, IDLab
The paper presents a robust generative framework for speech enhancement using vector quantization-based neural audio codecs, demonstrating significant improvements in performance and robustness. The comprehensive methodology, thorough experimental validation, and potential applications underscore its importance in advancing the field of audio processing.
The paper introduces two novel frameworks, cNAC-SE and dNAC-SE, which utilize vector quantization (VQ) in latent spaces for speech enhancement. The methodology is well-structured, with a clear distinction between continuous and discrete latent representations, and it includes theoretical analysis and empirical validation. The use of VQ regularization is a significant innovation, allowing for improved robustness in generative models. The architecture is based on transformer blocks, which are effectively employed to enhance latent representations. The approach is rigorous, with detailed descriptions of the encoder-decoder architecture and loss functions, enhancing the clarity of the proposed methods.
The experiments are comprehensive, utilizing the DNS3 Challenge dataset, which is appropriate for evaluating speech enhancement systems. The authors provide a thorough comparison of their models against various baselines, including both generative and discriminative approaches. The results demonstrate the superiority of the cNAC-SE model in terms of DNS-MOS scores across different test conditions, showcasing its effectiveness. The inclusion of ablation studies further strengthens the evaluation, revealing insights into the impact of fine-tuning and the advantages of clean-prior-constrained VQ.
The paper provides sufficient details regarding the experimental setup, including the dataset, training parameters, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for other researchers. The authors do provide a demo URL, which is beneficial for practical evaluation of their methods.
One identified limitation is the computational overhead associated with the full codec pipeline, which may hinder deployment in resource-constrained environments. Additionally, while the models show improved performance, the paper does not extensively discuss the potential trade-offs between computational efficiency and enhancement quality.
The proposed frameworks have significant implications for real-world applications in speech enhancement, particularly in scenarios involving noisy environments. The robustness of the cNAC-SE model, as demonstrated in the experiments, suggests potential for deployment in various audio processing applications, including telecommunications, hearing aids, and voice recognition systems. The findings also contribute to the broader understanding of generative models in audio processing, potentially influencing future research directions in the field. The paper presents a robust generative framework for speech enhancement using vector quantization-based neural audio codecs, demonstrating significant improvements in performance and robustness. The comprehensive methodology, thorough experimental validation, and potential applications underscore its importance in advancing the field of audio processing.
The rapid advancement of AI music generators highlights the urgent need for reliable Synthetic Song Detection (SSD). Existing SSD methods often rely on low-level artifacts or fixed feature assumptions, struggling to capture generator-agnostic cues. To address this, we propose Sofia (Synthetic-song detection framework via music features), a flexible framework that models music-intrinsic attributes via feature-specific experts and an adaptive Mixture-of-Experts (MoE) module. By configuring Sofia with representative Vocal, Audio-effect, Global structure features, and their combinations, we present their individual and complementary contributions. To comprehensively evaluate our framework, we further construct MUSIC8K, a challenging benchmark featuring lastest emerging generators and realistic audio perturbations. Experiments show that Sofia learns generator-agnostic representations from music-intrinsic features, improving the F1 score by 18.5 points over the strongest baseline on MUSIC8K-O while maintaining strong robustness.
Primary: Fudan University
All Institutions: Fudan University
The paper presents Sofia, a flexible framework for Synthetic Song Detection that effectively models music-intrinsic features, significantly improving detection accuracy and robustness against audio perturbations. The comprehensive methodology and the introduction of the MUSIC8K benchmark position this work as a meaningful contribution to the field of audio machine learning.
The paper introduces Sofia, a novel framework for Synthetic Song Detection (SSD) that utilizes music-intrinsic features through a Mixture-of-Experts (MoE) architecture. This approach allows for flexible feature combinations and adaptive learning, addressing the limitations of existing SSD methods that rely on low-level artifacts. The methodology is well-structured, with a clear problem setup, feature extraction, and fusion strategies that enhance generalization across different music generators. The use of feature-specific experts to capture distinct musical attributes is innovative and adds depth to the detection process.
The authors construct the MUSIC8K benchmark, which is a significant contribution to the field, providing a dataset that includes the latest music generators and realistic audio perturbations. The experiments demonstrate that Sofia significantly outperforms existing methods, improving the F1 score by 18.5 points over the strongest baseline. The evaluation metrics used, including F1 score and accuracy, are appropriate for the task and provide a clear picture of the framework's performance.
The paper provides detailed implementation information, including training configurations, audio preprocessing, and network architecture. However, the absence of a publicly accessible project URL or demo limits the reproducibility of the results. Future work should consider making the code and data available to facilitate validation by other researchers.
One limitation is the reliance on specific music features, which may not generalize well as music generation technology evolves. The authors acknowledge that the current features might become outdated, suggesting a need for continuous updates to the framework. Additionally, the framework's performance on less common genres or styles of music is not thoroughly tested.
The implications of this research are significant, as it addresses the growing concern of synthetic music proliferation. The ability to detect synthetic songs reliably can have applications in copyright enforcement, music quality control, and the preservation of artistic integrity. Furthermore, the framework's adaptability could lead to advancements in other areas of audio analysis and detection. The paper presents Sofia, a flexible framework for Synthetic Song Detection that effectively models music-intrinsic features, significantly improving detection accuracy and robustness against audio perturbations. The comprehensive methodology and the introduction of the MUSIC8K benchmark position this work as a meaningful contribution to the field of audio machine learning.
This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.
Primary: Institute of Software, Chinese Academy of Sciences
All Institutions: Institute of Software, Chinese Academy of Sciences, National Research Council Canada, The Chinese University of Hong Kong
The paper presents a novel online speaker adaptation method that leverages audio-textual prompts for elderly speech recognition. The technical contributions are significant, addressing critical challenges in the field and demonstrating substantial improvements in performance metrics, which could lead to enhanced user experiences in real-world applications.
The proposed methodology introduces a novel approach to speaker adaptation in elderly speech recognition by leveraging cross-utterance audio-textual prompts. This dual cross-modality fusion effectively captures both acoustic and language deficiencies, addressing the unique challenges posed by elderly speech. The use of a Q-Former for compressing variable-length history information is innovative and enhances the model's capability to adapt in real-time. However, the paper could benefit from more detailed explanations of the fusion strategies and the Q-Former architecture.
The experiments are well-structured, utilizing two relevant datasets (DementiaBank Pitt and JCCOCC MoCA) that are appropriate for the target demographic. The reported results demonstrate statistically significant improvements in WER and CER, showcasing the effectiveness of the proposed method over traditional speaker-independent models. The inclusion of ablation studies strengthens the findings, although further exploration of the impact of different configurations could provide deeper insights.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is described, specific hyperparameters, training procedures, and code availability are not mentioned, which could hinder reproducibility.
One limitation is the reliance on specific datasets, which may not generalize to all elderly populations or languages. Additionally, the method's performance in real-world scenarios with diverse acoustic environments remains untested. The paper also does not address potential ethical considerations related to the use of AI in sensitive applications like elderly care.
This research has significant implications for improving communication and social engagement among the elderly, particularly for those with speech impairments. By enhancing speech recognition systems to better accommodate elderly speakers, the work could contribute to advancements in assistive technologies and healthcare applications. The paper presents a novel online speaker adaptation method that leverages audio-textual prompts for elderly speech recognition. The technical contributions are significant, addressing critical challenges in the field and demonstrating substantial improvements in performance metrics, which could lead to enhanced user experiences in real-world applications.
This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous representations and discrete tokens in latent space, respectively. Theoretical analysis and visualisations in latent space are performed to exhibit their inherent modelling mechanisms. Experimental results show that the fully fine-tuned cNAC-SE model consistently outperforms all dNAC-SE variants across diverse test conditions and achieves leading performance among established generative approaches in DNS-MOS metrics. Comparison with the discriminative counterpart shows that VQ enhances robustness through an intrinsic effect of clean-prior-constrained regularisation, independent of discrete token processing. This highlights the transferable value of VQ regularisation to other continuous modelling methods.
Primary: Ghent University - imec
All Institutions: Ghent University - imec, IDLab
The paper presents a robust generative framework for speech enhancement using vector quantization-based neural audio codecs, demonstrating significant improvements in performance and robustness. The comprehensive methodology, thorough experimental validation, and potential applications underscore its importance in advancing the field of audio processing.
The paper introduces two novel frameworks, cNAC-SE and dNAC-SE, which utilize vector quantization (VQ) in latent spaces for speech enhancement. The methodology is well-structured, with a clear distinction between continuous and discrete latent representations, and it includes theoretical analysis and empirical validation. The use of VQ regularization is a significant innovation, allowing for improved robustness in generative models. The architecture is based on transformer blocks, which are effectively employed to enhance latent representations. The approach is rigorous, with detailed descriptions of the encoder-decoder architecture and loss functions, enhancing the clarity of the proposed methods.
The experiments are comprehensive, utilizing the DNS3 Challenge dataset, which is appropriate for evaluating speech enhancement systems. The authors provide a thorough comparison of their models against various baselines, including both generative and discriminative approaches. The results demonstrate the superiority of the cNAC-SE model in terms of DNS-MOS scores across different test conditions, showcasing its effectiveness. The inclusion of ablation studies further strengthens the evaluation, revealing insights into the impact of fine-tuning and the advantages of clean-prior-constrained VQ.
The paper provides sufficient details regarding the experimental setup, including the dataset, training parameters, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for other researchers. The authors do provide a demo URL, which is beneficial for practical evaluation of their methods.
One identified limitation is the computational overhead associated with the full codec pipeline, which may hinder deployment in resource-constrained environments. Additionally, while the models show improved performance, the paper does not extensively discuss the potential trade-offs between computational efficiency and enhancement quality.
The proposed frameworks have significant implications for real-world applications in speech enhancement, particularly in scenarios involving noisy environments. The robustness of the cNAC-SE model, as demonstrated in the experiments, suggests potential for deployment in various audio processing applications, including telecommunications, hearing aids, and voice recognition systems. The findings also contribute to the broader understanding of generative models in audio processing, potentially influencing future research directions in the field. The paper presents a robust generative framework for speech enhancement using vector quantization-based neural audio codecs, demonstrating significant improvements in performance and robustness. The comprehensive methodology, thorough experimental validation, and potential applications underscore its importance in advancing the field of audio processing.
With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges, adopting task-specific architectures or modules. This absence of a unified modeling paradigm substantially increases the overhead and complexity of building a system for both audio generation and editing, while also leading to limited scalability. To address this issue, we introduce AudioWeave, a unified model for TTA and audio editing without additional task-specific components. Specifically, we propose a joint condition modeling approach with a factorized position embedding, enabling the diffusion transformer backbone to operate under heterogeneous inputs of TTA and audio editing. We further propose a progressive multistage training strategy to mitigate task competition and catastrophic forgetting caused by interference among multiple tasks. This in turn helps maintain the performance of each individual task and may even lead to improvements in certain aspects. Experimental results on TTA task and six audio editing tasks show that our unified model achieves competitive performance with task-specific models, laying a groundwork for further exploration of unified audio generation models.
Primary: Institute of Artificial Intelligence of China Telecom (TeleAI)
All Institutions: Institute of Artificial Intelligence of China Telecom (TeleAI), Department of Electronic Engineering and Information Science, School of Artificial Intelligence, Tianjin University, Tianjin Key Laboratory of Cognitive Computing and Application
The main contribution of this paper is the introduction of AudioWeave, a unified model for audio generation and editing that leverages joint condition modeling and progressive training strategies to achieve competitive performance across multiple tasks. This work significantly advances the field by providing a comprehensive framework that integrates diverse audio generation tasks, potentially influencing future research and applications in audio processing and multimedia content creation.
The paper presents a unified model, AudioWeave, which integrates text-to-audio generation and audio editing through a joint condition modeling approach and a progressive multistage training strategy. The methodology is well-structured, addressing the challenge of task competition and catastrophic forgetting, and employs a diffusion transformer backbone that allows for effective interaction between different modalities. The introduction of factorized position embedding is a notable innovation that enhances the model's ability to handle heterogeneous inputs.
The experimental setup is robust, utilizing multiple datasets for both TTA and audio editing tasks. The results demonstrate competitive performance against state-of-the-art models, with both objective and subjective metrics being employed. The use of human evaluation (MOS) alongside objective metrics strengthens the credibility of the findings. The ablation studies provide valuable insights into the effectiveness of the proposed methods.
The implementation details are clearly outlined, including the model architecture, training procedures, and datasets used. However, the lack of a publicly available code repository limits full reproducibility. The detailed descriptions of the training strategy and evaluation metrics are beneficial for other researchers aiming to replicate or build upon this work.
One limitation is the reliance on pretrained components for certain parts of the model, which may affect the overall performance if those components are not optimally tuned. Additionally, the paper does not address the scalability of the model to larger datasets or more complex audio generation tasks, which could be a potential area for future work.
The proposed model has significant implications for multimedia applications, particularly in enhancing audio generation and editing capabilities in various domains such as film, gaming, and interactive media. The unified approach could streamline workflows in audio content creation, making it more efficient and accessible for creators. The main contribution of this paper is the introduction of AudioWeave, a unified model for audio generation and editing that leverages joint condition modeling and progressive training strategies to achieve competitive performance across multiple tasks. This work significantly advances the field by providing a comprehensive framework that integrates diverse audio generation tasks, potentially influencing future research and applications in audio processing and multimedia content creation.
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.
Primary: UKRI Centre for Doctoral Training in Artificial Intelligence and Music
All Institutions: UKRI Centre for Doctoral Training in Artificial Intelligence and Music, Sony AI, Google
The main contribution of this paper is the introduction of TuneJury, a novel pairwise reward model for text-to-music generation that improves preference alignment through innovative calibration techniques. This work represents a meaningful advancement in the intersection of machine learning and music generation, offering both a robust methodology and practical applications that could influence future research and development in the field.
The methodology presented in TuneJury is innovative, particularly in its approach to creating an instance-level pairwise reward model for text-to-music generation. The use of diverse human-preference labels, including arena-style votes and expert aesthetic ratings, enhances the robustness of the model. The introduction of anchor calibration as a post-hoc adjustment method is a significant contribution, as it allows for improved data efficiency without necessitating retraining. This aspect of the methodology is particularly noteworthy as it addresses a common challenge in machine learning applications, which is the need for extensive retraining when adapting to new models or datasets.
The experiments conducted in this paper are comprehensive, covering both held-out test pairs and out-of-distribution benchmarks. The results indicate that TuneJury remains competitive with prior baselines, showcasing its generalizability and effectiveness. The evaluation metrics used, including calibration of predicted scores, provide a solid foundation for assessing the model's performance. However, the paper could benefit from more detailed comparisons with existing state-of-the-art systems to further validate its claims.
The paper provides a GitHub repository link for the TuneJury project, which is crucial for reproducibility. However, the paper could enhance reproducibility by including more detailed descriptions of the datasets used, the training process, and the specific configurations of the models. Clear documentation in the repository would also aid other researchers in replicating the results.
One limitation of the study is the reliance on publicly available human-preference labels, which may introduce biases based on the demographics of the participants or the specific contexts in which the preferences were gathered. Additionally, while the model shows promise, its performance on more diverse music genres and styles has not been thoroughly evaluated, which could limit its applicability in broader contexts.
The implications of TuneJury are significant for the field of music generation, particularly in enhancing the alignment of generated music with user preferences. This has potential applications in various domains, including entertainment, gaming, and personalized music experiences. As music generation technology continues to evolve, tools like TuneJury could play a crucial role in making AI-generated music more accessible and enjoyable for users. The main contribution of this paper is the introduction of TuneJury, a novel pairwise reward model for text-to-music generation that improves preference alignment through innovative calibration techniques. This work represents a meaningful advancement in the intersection of machine learning and music generation, offering both a robust methodology and practical applications that could influence future research and development in the field.
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.
Primary: IIIT-Delhi
All Institutions: IIIT-Delhi
The paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight model that significantly improves detection performance while addressing practical deployment challenges. The comprehensive methodology and rigorous experimental validation establish a strong foundation for future research in audio deepfake detection, particularly in underrepresented languages.
The methodology presented in this paper is robust and innovative, particularly in the construction of the SEA-CF benchmark, which addresses a significant gap in the literature regarding CF detection in SEA languages. The dual-encoder architecture of GARUDA, which combines semantic and prosodic features, is a novel approach that leverages existing models effectively. The introduction of the JS alignment loss function further enhances the model's performance by ensuring better representation fusion. The authors provide a clear and systematic explanation of their framework, making it accessible for future research.
The experimental evaluation is comprehensive, including zero-shot and fine-tuned assessments of various ALMs on the SEA-CF benchmark. The results demonstrate that existing SOTA models struggle with SEA languages, highlighting the necessity of the proposed SEA-CF dataset. The statistical significance of the improvements achieved by GARUDA over baseline models is well-documented, reinforcing the validity of the findings. The use of multiple evaluation metrics (ACC and EER) provides a thorough understanding of the model's performance.
The paper provides sufficient implementation details, including model architectures, training configurations, and evaluation protocols, which support reproducibility. The authors have made the SEA-CF dataset and the GARUDA model publicly available, facilitating further research in this area. However, the lack of a demo or interactive tool limits immediate accessibility for practitioners.
The primary limitation is that SEA-CF does not yet cover all SEA languages, which may restrict the applicability of the findings. Additionally, the evaluation is currently limited to available benchmarks, and the authors acknowledge that future work will be needed to incorporate more languages and improve generalization across diverse generators.
This research has significant implications for enhancing the security of communication systems in SEA regions, where the risk of audio deepfakes is increasing. By providing a benchmark and a lightweight detection model, the authors contribute to the development of more inclusive and effective tools for combating misinformation and fraud in low-resource language contexts. The alignment with SDG goals emphasizes the social relevance of the work. The paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight model that significantly improves detection performance while addressing practical deployment challenges. The comprehensive methodology and rigorous experimental validation establish a strong foundation for future research in audio deepfake detection, particularly in underrepresented languages.
Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.
Primary: Foshan University
All Institutions: Foshan University, South China University of Technology, Tongji University
This paper presents a significant advancement in the assessment of non-verbal vocalizations in speech, introducing the NVMOS model and a novel dataset that addresses a critical gap in existing speech quality evaluation methodologies. The approach not only enhances the reliability of NV quality predictions but also opens avenues for future research in multimodal audio understanding and synthesis.
The paper introduces a novel approach to assessing the perceptual quality of non-verbal vocalizations (NVs) in speech through the development of the NVMOS model. The methodology is well-structured, utilizing a local NV-event focusing module that leverages cross-attention mechanisms to enhance the prediction accuracy of NV quality. The integration of a text-queried approach allows the model to focus specifically on NV events rather than treating the audio as a whole, which is a significant improvement over existing methods. The construction of the NV-MOS dataset, which includes both synthetic and natural NV samples rated by experts, provides a solid foundation for training and evaluating the model.
The experimental setup is comprehensive, involving a well-defined dataset split and rigorous evaluation metrics including Pearson, Spearman, and Kendall correlations. The results demonstrate that NVMOS achieves a high level of agreement with expert ratings, indicating its effectiveness in predicting NV quality. The ablation studies further substantiate the importance of the text-queried local focusing mechanism, showcasing that the model's design choices are backed by empirical evidence.
The paper provides sufficient details regarding the experimental setup, including the architecture of the NVMOS model, training parameters, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should consider releasing the model and dataset to facilitate further research and validation.
One limitation of the study is the reliance on a relatively small dataset of expert ratings, which may not capture the full variability of human perception across different contexts and populations. Additionally, while the model shows promise, it may still struggle with edge cases or NVs that are less common or more ambiguous, as indicated by the analysis of LLM judges.
The implications of this research extend to various applications in speech synthesis, emotion recognition, and human-computer interaction, where the ability to accurately assess and generate non-verbal vocalizations can enhance the expressiveness and naturalness of synthetic speech systems. By addressing the perceptual quality of NVs, this work contributes to the advancement of more nuanced and emotionally aware AI systems. This paper presents a significant advancement in the assessment of non-verbal vocalizations in speech, introducing the NVMOS model and a novel dataset that addresses a critical gap in existing speech quality evaluation methodologies. The approach not only enhances the reliability of NV quality predictions but also opens avenues for future research in multimodal audio understanding and synthesis.