Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.
Primary: Tsinghua University
All Institutions: Tsinghua University, University of Calabria, University of Cambridge
The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
The proposed RAMoEA-QA model introduces a novel hierarchical specialization approach that employs a two-stage conditional specialization mechanism, utilizing an Audio Mixture-of-Experts and a Language Mixture-of-Adapters. This design allows the model to effectively handle the diverse nature of respiratory audio data and various query intents, which is a significant advancement over existing monolithic biomedical audio-language QA systems. The use of pre-trained audio encoders and LoRA adapters on a frozen LLM demonstrates a thoughtful integration of state-of-the-art techniques while maintaining a low parameter overhead.
The paper presents a comprehensive experimental setup, comparing RAMoEA-QA against strong baselines and conducting ablation studies to validate the effectiveness of the routing mechanisms. The reported in-domain test accuracy of 0.72 significantly surpasses the state-of-the-art baselines (0.61 and 0.67), indicating robust performance. The experiments also address generalization across different domains, modalities, and tasks, which is critical for real-world applications in healthcare.
The authors provide a link to their code repository, which is essential for reproducibility. However, the paper could benefit from additional details regarding the implementation specifics, such as hyperparameter settings and training procedures, to facilitate easier replication of results by other researchers.
One limitation noted is the reliance on the RA-QA collection, which may not encompass the full diversity of respiratory audio data encountered in practice. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world environments remains to be fully validated.
The RAMoEA-QA model has significant potential applications in healthcare, particularly in respiratory care, where it can enhance patient monitoring and screening through scalable audio analysis. Its ability to handle diverse audio inputs and question formats could lead to more effective and personalized patient interactions, ultimately improving healthcare outcomes. The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
The paper presents a novel approach to data selection for ASR systems using embedding-based methods that capture speaker, phonetic, and semantic characteristics. The use of Maximal Marginal Relevance (MMR) to balance relevance and diversity in data selection is a significant methodological advancement. The multi-embedding and multi-target strategies enhance the robustness of the approach, allowing for effective training on large-scale, heterogeneous datasets. The methodology is well-structured, with clear definitions and mathematical formulations that enhance clarity and reproducibility.
The experiments are comprehensive, utilizing multiple target datasets (LibriSpeech, CommonVoice, TED-LIUM) to validate the effectiveness of the proposed data selection methods. The results demonstrate substantial improvements in word error rate (WER) when using strategically selected subsets compared to random selections and the full dataset. The experiments are well-designed, with appropriate controls and comparisons that provide strong evidence for the claims made.
The paper provides sufficient details regarding the implementation, including model architectures, training procedures, and data selection algorithms. However, the lack of publicly available code or datasets limits reproducibility. The use of specific embeddings and the complexity of the MMR selection process may pose challenges for others attempting to replicate the results without access to the same resources.
The paper acknowledges the computational expense of the greedy MMR procedure and the potential for label noise in the pseudo-labeled Granary dataset. Additionally, the reliance on embedding-based selection may not generalize across all domains or datasets, and the performance may vary based on the characteristics of the target domain.
The findings have significant implications for the deployment of ASR systems in specialized domains, particularly in scenarios where labeled data is scarce. The ability to effectively select relevant training data can enhance the performance of models in real-world applications, making this research highly relevant to both academia and industry. The approach may also inspire further research into data selection strategies in other machine learning domains. This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model's ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.
Primary: Victoria University of Wellington
All Institutions: Victoria University of Wellington, GN ReSound
The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.
The paper proposes a novel approach to own voice detection (OVD) in hearing aids using a single microphone by leveraging simulated acoustic transfer functions (ATFs) for data augmentation. The methodology is well-structured, involving a two-stage simulation-based ATF generation pipeline that transitions from a rigid-sphere model to a detailed head-and-torso representation. The use of a transformer-based classifier enhances the model's ability to learn from spatial propagation cues, which is a significant advancement over traditional methods that rely on multiple microphones or complex signal processing techniques. The hierarchical adaptation strategy employed to progressively fine-tune the model is a commendable aspect, allowing for improved generalization from simulated to real-world conditions.
The experimental results demonstrate high accuracy rates, achieving 95.52% on simulated head-and-torso test data and 80% on real hearing aid recordings without fine-tuning. The use of diverse datasets, including VoxCeleb1 and LibriSpeech, alongside real-world recordings, adds robustness to the evaluation. The model's performance under varying noise conditions was also assessed, showcasing its resilience, which is crucial for practical applications in hearing aids.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific URLs or repositories for code and data, which would enhance reproducibility. The absence of a demo or project URL limits the ability for other researchers to replicate the findings directly. However, the comprehensive description of the data augmentation process and model training strategies offers a solid foundation for future implementations.
One limitation is the reliance on simulated data for training, which may not fully capture the complexities of real-world acoustic environments. The model's performance on real recordings, while promising, may still be affected by factors not accounted for in the simulations. Additionally, the study focuses on offline segment-level detection, leaving out considerations for real-time applications, which are critical in hearing aid technology.
The proposed method has significant implications for the design of hearing aids, particularly in enhancing user comfort and speech intelligibility without increasing device complexity or cost. By enabling effective OVD with a single microphone, this research could lead to more accessible hearing aid solutions for individuals with hearing impairments, potentially improving their quality of life. The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.
Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.
Primary: Tsinghua University
All Institutions: Tsinghua University, University of Calabria, University of Cambridge
The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
The proposed RAMoEA-QA model introduces a novel hierarchical specialization approach that employs a two-stage conditional specialization mechanism, utilizing an Audio Mixture-of-Experts and a Language Mixture-of-Adapters. This design allows the model to effectively handle the diverse nature of respiratory audio data and various query intents, which is a significant advancement over existing monolithic biomedical audio-language QA systems. The use of pre-trained audio encoders and LoRA adapters on a frozen LLM demonstrates a thoughtful integration of state-of-the-art techniques while maintaining a low parameter overhead.
The paper presents a comprehensive experimental setup, comparing RAMoEA-QA against strong baselines and conducting ablation studies to validate the effectiveness of the routing mechanisms. The reported in-domain test accuracy of 0.72 significantly surpasses the state-of-the-art baselines (0.61 and 0.67), indicating robust performance. The experiments also address generalization across different domains, modalities, and tasks, which is critical for real-world applications in healthcare.
The authors provide a link to their code repository, which is essential for reproducibility. However, the paper could benefit from additional details regarding the implementation specifics, such as hyperparameter settings and training procedures, to facilitate easier replication of results by other researchers.
One limitation noted is the reliance on the RA-QA collection, which may not encompass the full diversity of respiratory audio data encountered in practice. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world environments remains to be fully validated.
The RAMoEA-QA model has significant potential applications in healthcare, particularly in respiratory care, where it can enhance patient monitoring and screening through scalable audio analysis. Its ability to handle diverse audio inputs and question formats could lead to more effective and personalized patient interactions, ultimately improving healthcare outcomes. The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.
Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
The paper presents a novel approach to data selection for ASR systems using embedding-based methods that capture speaker, phonetic, and semantic characteristics. The use of Maximal Marginal Relevance (MMR) to balance relevance and diversity in data selection is a significant methodological advancement. The multi-embedding and multi-target strategies enhance the robustness of the approach, allowing for effective training on large-scale, heterogeneous datasets. The methodology is well-structured, with clear definitions and mathematical formulations that enhance clarity and reproducibility.
The experiments are comprehensive, utilizing multiple target datasets (LibriSpeech, CommonVoice, TED-LIUM) to validate the effectiveness of the proposed data selection methods. The results demonstrate substantial improvements in word error rate (WER) when using strategically selected subsets compared to random selections and the full dataset. The experiments are well-designed, with appropriate controls and comparisons that provide strong evidence for the claims made.
The paper provides sufficient details regarding the implementation, including model architectures, training procedures, and data selection algorithms. However, the lack of publicly available code or datasets limits reproducibility. The use of specific embeddings and the complexity of the MMR selection process may pose challenges for others attempting to replicate the results without access to the same resources.
The paper acknowledges the computational expense of the greedy MMR procedure and the potential for label noise in the pseudo-labeled Granary dataset. Additionally, the reliance on embedding-based selection may not generalize across all domains or datasets, and the performance may vary based on the characteristics of the target domain.
The findings have significant implications for the deployment of ASR systems in specialized domains, particularly in scenarios where labeled data is scarce. The ability to effectively select relevant training data can enhance the performance of models in real-world applications, making this research highly relevant to both academia and industry. The approach may also inspire further research into data selection strategies in other machine learning domains. This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.
Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.
Primary: Idiap Research Institute
All Institutions: Idiap Research Institute, Tallinn University of Technology
The main contribution of this paper is the demonstration that compact SSL backbones can achieve competitive performance in audio deepfake detection through careful pre-training strategies, while also introducing a novel method for assessing model calibration under distributional shifts. This work significantly advances the understanding of how SSL pre-training affects model robustness and reliability in practical applications.
The paper introduces RAPTOR, a pairwise-gated hierarchical layer-fusion architecture, to evaluate the performance of compact self-supervised learning (SSL) backbones for audio deepfake detection. The methodology is robust, employing a controlled experimental setup where only the SSL encoder is varied while keeping the downstream detection framework constant. This approach allows for a clear analysis of the impact of different pre-training strategies on model performance. The introduction of test-time augmentation (TTA) for uncertainty estimation is particularly noteworthy, as it provides a novel way to assess model calibration beyond traditional metrics.
The authors conduct extensive experiments across 14 cross-domain benchmarks, which is a significant contribution to the field as it highlights the robustness of the proposed models under varying conditions. The results demonstrate that the compact models can achieve competitive performance compared to larger models, which is an important finding for practical applications. The use of multiple evaluation metrics, including EER and pooled EER, adds depth to the analysis and provides a comprehensive view of model performance.
The paper provides sufficient implementation details, including training protocols, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the findings.
One limitation of the study is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of real-world audio deepfake scenarios. Additionally, the paper acknowledges the need for further investigation into the sensitivity-diversity trade-off observed in the final mHuBERT checkpoint.
The findings of this research have significant implications for the field of audio deepfake detection, particularly in enhancing the reliability of detection systems in real-world applications. The emphasis on model calibration and the effectiveness of compact models could lead to more accessible and efficient solutions for combating audio deepfakes. The main contribution of this paper is the demonstration that compact SSL backbones can achieve competitive performance in audio deepfake detection through careful pre-training strategies, while also introducing a novel method for assessing model calibration under distributional shifts. This work significantly advances the understanding of how SSL pre-training affects model robustness and reliability in practical applications.
Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real-world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML-ITW (Multilingual In-The-Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited generalization ability of existing detectors in practical scenarios. The ML-ITW dataset is publicly available.
Primary: Wuhan University
All Institutions: Wuhan University
The main contribution of this work is the introduction of the ML-ITW dataset, which provides a realistic benchmark for evaluating speech deepfake detection systems across multiple languages and platforms. This comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the pressing need for improved detection mechanisms in the face of rapidly advancing speech synthesis technologies.
The paper introduces a novel dataset, ML-ITW, which is a significant advancement in the field of speech deepfake detection. The methodology for dataset construction is robust, utilizing a diverse range of social media platforms and languages, which enhances the realism of the evaluation scenarios. The evaluation of various detection paradigms, including end-to-end models, self-supervised methods, and audio large language models, is comprehensive and well-structured. The approach to validating spoofed samples is methodical, ensuring high-quality data for training and evaluation.
The experiments are thorough, comparing multiple models across different datasets, including ASVspoof2019-LA, ITW, and ML-ITW. The results clearly demonstrate the performance degradation of existing models when faced with real-world conditions, highlighting the limitations of current benchmarks. The use of standard metrics (EER, AUC, ACC, F1) adds rigor to the evaluation, although the paper could benefit from more detailed statistical analysis of the results to strengthen claims about generalization gaps.
The paper provides sufficient details regarding the dataset construction, model training, and evaluation protocols, which would allow other researchers to replicate the experiments. However, the absence of a direct implementation link or code repository limits the ease of reproducibility.
One notable limitation is the relatively small sample size for some low-resource languages, which may affect the reliability of language-specific analyses. Additionally, while the dataset is comprehensive, the evolving nature of speech synthesis technologies means that the dataset may quickly become outdated, necessitating continuous updates.
The findings of this research have significant implications for the development of robust deepfake detection systems. By highlighting the importance of realistic evaluation benchmarks, the study encourages future research to focus on generalization across diverse conditions, ultimately contributing to the enhancement of security measures against identity impersonation and misinformation. The main contribution of this work is the introduction of the ML-ITW dataset, which provides a realistic benchmark for evaluating speech deepfake detection systems across multiple languages and platforms. This comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the pressing need for improved detection mechanisms in the face of rapidly advancing speech synthesis technologies.
Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.
Primary: University of Southern California
All Institutions: University of Southern California, Johns Hopkins University
The main contribution of this paper is the introduction of a self-supervised representation reconstruction loss that significantly enhances the performance of neural audio codecs in terms of intelligibility and latency. This work represents a meaningful advancement in the field of audio processing, providing a practical solution for real-time applications while also contributing to the theoretical understanding of codec training methodologies.
The paper introduces a novel self-supervised representation reconstruction (SSRR) loss that improves the training of neural audio codecs. The methodology is well-articulated, detailing how SSRR enhances convergence speed and intelligibility without requiring additional lookahead in streaming architectures. The approach is innovative in its focus on reconstructing self-supervised representations, which is a departure from traditional methods that prioritize mel-spectrogram reconstruction. The use of a single GPU for competitive results is a significant practical consideration that enhances the method's appeal for real-world applications.
The experiments conducted are robust, demonstrating the effectiveness of SSRR through comparative analysis with existing methods. The results indicate that the proposed JHCodec achieves state-of-the-art performance, particularly in terms of intelligibility and latency, which are critical metrics in audio codec applications. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated further to strengthen the experimental validation.
The authors have taken steps to ensure reproducibility by open-sourcing the full implementation and training pipeline, which is commendable. The availability of a demo on GitHub allows for practical testing of the proposed system, although the paper could benefit from a more detailed description of the training process and hyperparameters used.
One limitation noted is the reliance on self-supervised representations, which may not generalize well across all types of audio content. Additionally, while the zero-lookahead architecture is advantageous for real-time applications, it may impose constraints on the complexity of the audio being processed. The paper could also discuss potential trade-offs between intelligibility and other audio quality metrics, such as fidelity.
The implications of this work are significant for applications in real-time audio processing, such as telecommunication and streaming services. By achieving high intelligibility with low latency, the proposed codec could enhance user experiences in various audio-related fields. Furthermore, the open-source nature of the project encourages further research and development in neural audio codecs, potentially leading to broader advancements in the field. The main contribution of this paper is the introduction of a self-supervised representation reconstruction loss that significantly enhances the performance of neural audio codecs in terms of intelligibility and latency. This work represents a meaningful advancement in the field of audio processing, providing a practical solution for real-time applications while also contributing to the theoretical understanding of codec training methodologies.
We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/
Primary: Institute for Infocomm Research (I2R)
All Institutions: Institute for Infocomm Research (I2R), Nanyang Technological University, The Hong Kong Polytechnic University
The main contribution of this paper is the introduction of a novel supervised finetuning approach combined with frame-level emotion distillation for emotion-preserving streaming speaker anonymization, which significantly improves emotion retention while maintaining privacy and intelligibility. The technical contributions and rigorous methodology present a meaningful advancement in the field of audio processing and speaker anonymization.
The methodology presented in this paper is innovative, focusing on supervised finetuning with neutral-emotion utterance pairs and frame-level emotion distillation. This dual approach effectively addresses the limitations of existing neural audio codec models in preserving emotional content during speaker anonymization. The use of neutral-emotion pairs ensures that the model learns to generate emotional outputs without relying on emotional prompts, which can be difficult to obtain. The design choice to apply emotion distillation to the acoustic branch rather than the semantic branch is a significant improvement that allows for cleaner gradient flow and better emotion preservation.
The experiments are well-structured, adhering to the VoicePrivacy 2024 protocol, which allows for direct comparison with prior works. The results show a substantial improvement in emotion preservation (UAR) and privacy (EER) while maintaining competitive intelligibility (WER). The ablation studies provide clear evidence of the contributions of each component of the proposed method, reinforcing the claims made in the paper. The dataset used for training and evaluation is appropriate, although the reliance on acted speech corpora may limit generalizability.
The paper provides sufficient details regarding the implementation, including the training setup, data preprocessing, and evaluation metrics. However, the absence of a public code repository limits reproducibility. The authors mention that the demo is available, which is a positive aspect, but a comprehensive project URL would enhance reproducibility further.
The paper acknowledges several limitations, including the reliance on a single SER evaluator, the lack of subjective listening tests, and the evaluation being restricted to acted speech corpora. These factors could affect the generalizability and real-world applicability of the findings. Additionally, the gap in performance compared to offline methods suggests that further improvements are needed for practical deployment.
The proposed method has significant implications for privacy-preserving applications in various domains, including teleconferencing, call centers, and online mental health counseling. By effectively anonymizing speaker identity while preserving emotional content, this research addresses a critical need for maintaining communication effectiveness in sensitive contexts. The approach could pave the way for more sophisticated anonymization techniques that balance privacy and emotional expressiveness. The main contribution of this paper is the introduction of a novel supervised finetuning approach combined with frame-level emotion distillation for emotion-preserving streaming speaker anonymization, which significantly improves emotion retention while maintaining privacy and intelligibility. The technical contributions and rigorous methodology present a meaningful advancement in the field of audio processing and speaker anonymization.
Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.
Primary: Southeast University
All Institutions: Southeast University, Nanyang Technological University, Tianjin University
This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text input. The proposed methodology effectively addresses the challenges of prosody and long-form stability in TTS systems, making a meaningful contribution to the field of audio machine learning.
The proposed methodology introduces a novel prosodic-boundary-aware adaptation strategy that leverages weakly time-aligned data to enhance streaming TTS systems. The bifurcated sequence input with a prosodic-boundary marker allows for improved prosody while maintaining contextual integrity. The sliding-window prompt mechanism effectively manages the context length, preventing unbounded growth and ensuring seamless audio generation. The approach avoids complex architectural modifications, which is a significant advantage. However, the reliance on weakly aligned data raises questions about the generalizability of the method across different datasets and languages.
The experiments are well-structured, utilizing both objective and subjective metrics to evaluate performance. The use of the Seed-TTS-Eval benchmark for both standard and long-form evaluations provides a comprehensive assessment of the proposed method's effectiveness. The significant improvements in WER, speaker similarity, and emotional consistency demonstrate the robustness of the approach. However, the paper could benefit from a more extensive comparison with additional state-of-the-art methods to further validate its superiority.
The paper provides sufficient details on the experimental setup, including dataset descriptions, evaluation metrics, and baseline comparisons. However, the lack of a publicly available code repository limits reproducibility. Future work should consider sharing the implementation to facilitate validation by the research community.
One limitation is the dependency on weakly time-aligned data, which may not be available for all languages or datasets. Additionally, while the results are promising, the method's performance in highly variable or noisy environments has not been tested. The paper also does not address the potential computational costs associated with the proposed adaptations.
The advancements in streaming TTS systems have significant implications for interactive applications such as virtual assistants, real-time translation, and accessibility tools. The ability to generate natural-sounding speech with minimal latency can enhance user experience and broaden the applicability of TTS technologies in various domains. This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text input. The proposed methodology effectively addresses the challenges of prosody and long-form stability in TTS systems, making a meaningful contribution to the field of audio machine learning.
Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.
Primary: Sungkyunkwan University
All Institutions: Sungkyunkwan University
The main contribution of this paper is the introduction of Whisper-CD, a contrastive decoding framework that significantly improves long-form speech recognition accuracy without the need for retraining. This work is a meaningful advancement in the field, addressing critical issues in existing models and offering a practical solution that can be readily adopted in deployed systems.
The proposed Whisper-CD framework introduces a novel approach to long-form speech recognition by employing a training-free contrastive decoding method. This method contrasts clean-audio logits with negative logits derived from various perturbations, which is innovative in the context of improving the robustness of speech recognition systems. The use of the log-sum-exp operator to aggregate negative samples is a thoughtful choice that enhances the decoding process. However, the paper could benefit from a more detailed explanation of the perturbations chosen and their specific impact on the model's performance.
The paper presents a comprehensive evaluation across five English long-form benchmarks, demonstrating a significant reduction in word error rate (WER) and improved token generation throughput. The results are compelling, particularly the 24.3 percentage point reduction in WER on the CORAAL dataset and the 48% increase in throughput compared to traditional beam search methods. However, additional details on the datasets used, including their characteristics and the specific metrics employed, would strengthen the experimental section.
The paper lacks sufficient implementation details, such as hyperparameters, specific configurations of the Whisper model, and the exact nature of the perturbations applied. Without these details, it may be challenging for other researchers to reproduce the results. Including a link to a code repository or supplementary material would greatly enhance reproducibility.
One limitation of the Whisper-CD approach is that it operates purely at inference time, which, while advantageous for deployment, may limit its adaptability to different audio conditions or languages without retraining. Additionally, the reliance on specific perturbations may not generalize across all types of audio inputs, potentially affecting performance in diverse real-world scenarios.
The proposed method has significant implications for the field of speech recognition, particularly in applications requiring high accuracy over long-form audio, such as transcription services, media content creation, and accessibility technologies. By improving the reliability of long-form speech recognition systems, Whisper-CD can enhance user experience and broaden the adoption of such technologies. The main contribution of this paper is the introduction of Whisper-CD, a contrastive decoding framework that significantly improves long-form speech recognition accuracy without the need for retraining. This work is a meaningful advancement in the field, addressing critical issues in existing models and offering a practical solution that can be readily adopted in deployed systems.
Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and investigate whether it can be identified and controlled directly in activation space. We extract layer-wise encoder activations and estimate mean-shift directions capturing accent-induced representation shifts. By injecting these directions into individual layers and measuring how they align accented and standard embeddings, we derive a layer-wise accent sensitivity profile, revealing that accent information concentrates in a narrow band of middle encoder layers. Leveraging this structure, we further introduce parameter-free accent steering that modifies representations during inference without updating model weights. Experiments across eight accents show consistent word error rate reductions.
Primary: The University of Melbourne
All Institutions: Wuhan University, The University of Melbourne
The main contribution of this paper is the introduction of a novel method for accent adaptation in speech recognition models that operates in activation space, providing a deeper understanding of accent representation in neural networks. The approach is innovative and has the potential to significantly improve the performance of speech recognition systems across diverse accents, marking a meaningful advancement in the field.
The methodology presented in this paper is innovative as it shifts the focus from traditional parameter fine-tuning to a more interpretable approach that directly manipulates activation space for accent adaptation. The authors successfully extract layer-wise encoder activations and compute mean-shift directions to capture accent-induced shifts, which is a novel contribution to the understanding of how accents are encoded in neural networks. The introduction of parameter-free accent steering is particularly noteworthy, as it allows for real-time adjustments during inference without the need for retraining, which could have significant practical implications.
The experiments conducted across eight different accents provide a robust evaluation of the proposed method. The consistent reductions in word error rates across these accents demonstrate the effectiveness of the approach. However, the paper could benefit from a more detailed description of the datasets used, including their size, diversity, and how they were selected. Additionally, comparisons with existing state-of-the-art methods would strengthen the validation of the proposed technique.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is described, specific hyperparameters, training procedures, and the architecture of the models used in experiments are not adequately detailed. Providing a link to a code repository or supplementary materials would enhance reproducibility.
One limitation of the study is the potential overfitting to the specific accents tested, as the results may not generalize to other accents or dialects not included in the experiments. Additionally, the approach relies on the assumption that accent information is concentrated in specific layers, which may not hold true for all architectures or tasks. The paper also does not address the computational efficiency of the proposed method during inference.
The findings of this research could have significant implications for the development of more inclusive and accurate speech recognition systems, particularly in multilingual and multicultural contexts. By improving accent adaptation, the proposed method could enhance user experience and accessibility in various applications, from virtual assistants to automated transcription services. The main contribution of this paper is the introduction of a novel method for accent adaptation in speech recognition models that operates in activation space, providing a deeper understanding of accent representation in neural networks. The approach is innovative and has the potential to significantly improve the performance of speech recognition systems across diverse accents, marking a meaningful advancement in the field.
Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.
Primary: University of Southern California
All Institutions: University of Southern California
This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.
The methodology presented in this paper is innovative, combining real-time MRI, EEG, and surface EMG to capture the complex dynamics of speech production. The authors have developed a multi-stage denoising pipeline to address significant technical challenges, including electromagnetic interference and myogenic artifacts. The use of canonical correlation analysis (CCA) for artifact removal is particularly noteworthy, as it allows for the effective suppression of non-neural signals while preserving the underlying neural activity. However, the methodology could benefit from further validation across a larger cohort to establish its robustness and generalizability.
The experimental design is well-structured, focusing on a single subject to explore the feasibility of simultaneous data acquisition. The tasks are clearly defined, and the results demonstrate the effectiveness of the artifact removal techniques. However, the reliance on a single participant limits the generalizability of findings. The authors provide thorough comparisons of EEG signals before and after denoising, showcasing significant improvements in signal quality, which is a strong point of the experimental evaluation.
The paper provides detailed descriptions of the experimental setup, data acquisition methods, and artifact correction techniques, which are essential for reproducibility. However, the lack of a publicly available dataset or code repository hinders full reproducibility of the results. Future work should include sharing data and methodologies to allow other researchers to validate and build upon these findings.
The primary limitations include the small sample size (single-subject study), which restricts the ability to generalize findings. Additionally, the use of passive electrodes may introduce higher noise levels compared to active electrodes, potentially affecting data quality. The EEG cap's design may not be optimal for capturing speech-specific brain activity, and residual artifacts from the EMG setup could still influence results. Lastly, the potential impact of scanner noise and visual stimuli on neural activity remains a concern.
This research has significant implications for both speech neuroscience and brain-computer interface (BCI) technologies. By providing a comprehensive view of the neural, muscular, and articulatory components of speech production, the findings could lead to advancements in silent speech interfaces and improved understanding of speech disorders. The methodology could pave the way for future studies exploring the intricacies of speech planning and execution, potentially transforming approaches to speech rehabilitation and communication technologies. This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.
Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.
Primary: Harvard University
All Institutions: Harvard University, Massachusetts Institute of Technology
The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.
The paper introduces a novel phoneme recognition system, BabAR, tailored for child speech, leveraging a large-scale dataset, TinyVox, which encompasses diverse languages and extensive child vocalizations. The methodology includes pretraining on multilingual data and context-aware fine-tuning, which are innovative approaches in the domain of child speech recognition. The use of Connectionist Temporal Classification (CTC) for sequence-to-sequence tasks is appropriate given the challenges of variable-length outputs in phoneme recognition. The systematic evaluation of different self-supervised models and the exploration of context duration for improving recognition accuracy are well-structured and contribute significantly to the methodology.
The experiments are robust, comparing BabAR against state-of-the-art phoneme recognition systems and demonstrating significant performance improvements. The paper provides detailed error analysis, illustrating that BabAR's substitutions tend to remain within phonetic categories, which is crucial for developmental analyses. The validation of BabAR's performance against a longitudinal dataset supports its practical applicability in developmental research. However, the paper could benefit from more extensive comparisons with existing systems and additional metrics beyond phoneme error rates to fully capture the model's effectiveness.
The authors provide sufficient implementation details, including model architecture, training procedures, and evaluation metrics, which enhance reproducibility. The availability of the dataset and code on GitHub is a significant step towards ensuring that other researchers can replicate the study and build upon it. However, the paper could improve by including more explicit instructions for setting up the environment and dependencies.
The study acknowledges challenges in phonetic transcription, particularly the subjective nature of human annotation and the presence of competing signals in naturalistic recordings. While BabAR shows promise, the reliance on coarse-grained measures for validation may not guarantee accuracy at the individual level, which is critical for clinical applications. Additionally, the dataset's diversity in terms of language and age could introduce variability that may affect generalization.
The development of BabAR and TinyVox has the potential to revolutionize the study of early speech development by enabling large-scale, automated phonetic analysis. This could facilitate early detection of speech and language delays, enhance cross-linguistic studies, and improve educational tools for language learning. The integration of advanced machine learning techniques with developmental science opens up new avenues for research and practical applications in child language acquisition. The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.
Primary: Salesforce AI Research
All Institutions: Salesforce AI Research
The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.
The paper presents a systematic approach to building enterprise-grade realtime voice agents by dissecting the components of speech-to-text (STT), language model (LLM), and text-to-speech (TTS) into a cascaded streaming pipeline. The authors emphasize the importance of streaming and pipelining rather than relying on a single fast model, which is a critical insight for achieving low latency in voice interactions. The tutorial format is effective, providing a step-by-step guide that includes empirical evaluations of various models, thus making the methodology accessible and practical for developers.
The experiments conducted are robust, comparing the performance of native speech-to-speech models against a cascaded pipeline approach. The authors provide detailed latency measurements for each component, demonstrating the effectiveness of their proposed architecture in achieving sub-1-second time-to-first-audio (TTFA). The empirical results are well-documented, showcasing the advantages of their approach in real-world scenarios, which adds credibility to their findings.
The paper includes a comprehensive codebase released as open-source, which is a significant advantage for reproducibility. The detailed tutorial format, along with the release of tested code for each component, allows other researchers and practitioners to replicate the results and build upon the work. However, the paper could benefit from clearer documentation on the specific environments and dependencies required to run the code effectively.
One limitation noted is the reliance on cloud APIs for some components, which may introduce variability in performance due to network latency. Additionally, the findings are based on specific models and configurations, which may not generalize across all potential implementations. The authors also acknowledge that native speech-to-speech models are not yet viable for real-time applications, which highlights the current constraints in the field.
This work has significant implications for the development of voice-based AI agents in enterprise settings, particularly in applications such as customer service, healthcare, and task management. By providing a clear framework and practical guidance, the paper can facilitate the adoption of real-time voice agents, potentially transforming user interactions across various industries. The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.
While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the imperceptible waveform variations used in prior watermarking methods. To address this limitation, we propose Latent-Mark, the first zero-bit audio watermarking framework designed to survive semantic compression. Our key insight is that robustness to the encode-decode process requires embedding the watermark within the codec's invariant latent space. We achieve this by optimizing the audio waveform to induce a detectable directional shift in its encoded latent representation, while constraining perturbations to align with the natural audio manifold to ensure imperceptibility. To prevent overfitting to a single codec's quantization rules, we introduce Cross-Codec Optimization, jointly optimizing the waveform across multiple surrogate codecs to target shared latent invariants. Extensive evaluations demonstrate robust zero-shot transferability to unseen neural codecs, achieving state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility. Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
Primary: National Taiwan University
All Institutions: National Taiwan University, CyCraft AI Lab, MoonShine Animation Studio, RIKEN Center for Computational Science
The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.
The methodology presented in Latent-Mark is innovative, leveraging the concept of embedding watermarks within the invariant latent space of neural audio codecs. The approach of optimizing the audio waveform to induce a detectable shift while ensuring imperceptibility is a significant advancement over traditional methods. The introduction of Cross-Codec Optimization is particularly noteworthy, as it addresses the challenge of overfitting to specific codec characteristics, enhancing the generalizability of the watermarking technique across different audio codecs.
The paper provides extensive evaluations demonstrating the robustness of the proposed method against both traditional DSP attacks and neural resynthesis. The experiments are well-structured, showcasing the performance of Latent-Mark in various scenarios, including zero-shot transferability to unseen codecs. The results indicate a strong resilience to attacks while maintaining perceptual quality, which is crucial for practical applications.
The paper lacks detailed implementation specifics, such as code availability or datasets used for training and evaluation, which could hinder reproducibility. Providing a GitHub repository or links to datasets would significantly enhance the reproducibility of the results.
One limitation of the study is the potential dependency on the specific codecs chosen for Cross-Codec Optimization. While the method shows promise, its performance on a broader range of codecs, especially those not included in the training phase, remains to be fully explored. Additionally, the paper does not address the computational complexity of the optimization process, which could impact real-time applications.
The implications of this research are significant, as it opens avenues for secure audio transmission and copyright protection in an era where neural codecs are becoming prevalent. The ability to maintain watermark integrity against advanced generative models could have far-reaching applications in media, entertainment, and digital rights management. The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.
Primary: University of Melbourne
All Institutions: University of Melbourne, Republic of Korea, School of Electrical Engineering
The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.
The paper introduces a structured approach to environmental sound deepfake detection (ESDD) through the formulation of a challenge that includes a well-defined dataset (EnvSDD) and evaluation protocols. The methodology is robust, focusing on two distinct tracks that assess generalization across unseen generators and black-box scenarios, which are critical for real-world applications. The use of diverse audio generation models and the emphasis on cross-generator generalization are notable strengths. However, the paper could benefit from a more detailed explanation of the architectural choices made by the top-performing systems.
The experimental evaluation is comprehensive, with a large number of submissions (1,748) from 97 teams, indicating significant interest and engagement in the challenge. The results are systematically presented, showcasing the performance of baseline systems and top submissions across different tracks. The use of the Equal Error Rate (EER) as a metric is appropriate for the task, and the analysis of system design trends provides valuable insights into effective strategies for ESDD.
While the paper mentions the availability of the EnvSDD dataset and the challenge results, it lacks detailed implementation specifics that would facilitate reproducibility. The inclusion of code repositories or links to the actual implementations of the top-performing systems would enhance reproducibility and allow other researchers to build upon this work.
One limitation is the potential overfitting of models to specific generators, as indicated by performance degradation on unseen generators. Additionally, the challenge does not address the potential for adversarial attacks on detection systems, which could be a significant concern in practical applications. The reliance on a specific evaluation metric (EER) may also limit the understanding of model performance across different contexts.
The implications of this work are significant, as it addresses a growing concern in the realm of audio deepfakes, which can have serious consequences for public safety and misinformation. The establishment of a benchmark for ESDD could catalyze further research and development in this area, leading to more robust detection systems that can be applied in various real-world scenarios, including security and media verification. The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
Primary: Shanghai Jiao Tong University
All Institutions: National Taiwan University, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.
The paper introduces a robust methodology for constructing a large-scale audio-text dataset, TW-Sound580K, specifically targeting the unique acoustic characteristics of Taiwanese dialects. The Verify-Generate-Critique (VGC) protocol is a notable innovation, effectively addressing the challenges of data curation in a linguistically diverse context. The integration of Dual-ASR validation to filter and enhance the dataset quality is commendable, as it mitigates the risks of hallucinations in audio transcription. The dynamic Dual-ASR Arbitration mechanism further strengthens the inference process by selecting the most accurate transcription based on acoustic-conditioned perplexity, showcasing a thoughtful approach to model adaptation.
The experimental validation of the Tai-LALM model on the TAU Benchmark demonstrates a significant performance improvement over the baseline, achieving 49.1% accuracy. This empirical evidence supports the effectiveness of the proposed dataset and methodology. The paper includes a comprehensive ablation study that isolates the contributions of various components, reinforcing the robustness of the findings. However, the reliance on a single benchmark may limit the generalizability of the results.
The authors provide a clear outline of their methodology, including the dataset construction process and the training setup for Tai-LALM. However, the lack of direct access to the raw audio data due to copyright constraints poses challenges for full reproducibility. The mention of providing source URLs and metadata upon de-anonymization is a positive step towards enabling future research.
The paper acknowledges several limitations, including the empirical nature of the VGC curation threshold, which may require recalibration for different regions. Additionally, the latency and VRAM overhead introduced by the Dual-ASR arbitration could hinder deployment in resource-constrained environments. The evaluation primarily focuses on the TAU Benchmark, which may not capture the full spectrum of performance across diverse acoustic scenarios.
This work has significant implications for the development of localized audio-language models, particularly in under-resourced linguistic regions. By addressing the localization gap, the proposed dataset and methodologies can enhance the performance of LALMs in understanding regional dialects and acoustic features. The framework established in this paper could serve as a model for similar efforts in other culturally rich but underrepresented areas. The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.
Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.
The proposed ASR-TRA framework introduces a novel approach to test-time adaptation (TTA) in automatic speech recognition (ASR) by leveraging reinforcement learning (RL) and causal interventions. The methodology is well-structured, utilizing a learnable decoder prompt and temperature-controlled stochastic decoding to generate diverse transcription candidates. The integration of a reward model based on audio-text semantic alignment is a significant innovation that addresses the limitations of existing TTA methods, which often rely on pseudo-labeling or entropy minimization. The use of a Structural Causal Model (SCM) to formalize the adaptation process adds rigor to the approach, although the paper could benefit from a more detailed explanation of the causal relationships involved.
The experiments conducted on the LibriSpeech and L2 Arctic datasets demonstrate the effectiveness of ASR-TRA in improving ASR robustness against noise and accent variations. The results indicate a significant reduction in word error rates (WER) compared to existing TTA methods, showcasing the practical applicability of the proposed framework. The ablation studies provide valuable insights into the contributions of different components, confirming the importance of both prompt tuning and reward modeling. However, the paper could enhance its experimental evaluation by including more diverse datasets and real-world scenarios to further validate the robustness of the method.
The paper provides sufficient details regarding the implementation of ASR-TRA, including the architecture, datasets, and evaluation metrics. The inclusion of hyperparameters and specific configurations aids in reproducibility. However, the lack of a comprehensive description of the training process and the absence of a public demo could hinder full reproducibility for other researchers.
One limitation of the proposed method is its reliance on the CLAP reward model, which may not generalize well across all types of audio inputs. Additionally, while the method shows improvements in accuracy and latency, the computational cost associated with generating multiple candidates and evaluating them could be a concern in resource-constrained environments. The paper also does not address potential scalability issues when deploying the model in real-time applications.
The ASR-TRA framework has the potential to significantly enhance the robustness of ASR systems in real-world applications, particularly in environments with high noise levels or diverse accents. This could lead to improved accessibility and user experience in various domains, including voice-activated assistants, transcription services, and communication aids for individuals with speech impairments. The focus on test-time adaptation without requiring ground-truth labels is particularly relevant for applications where labeled data is scarce or unavailable. The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.
Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.
Primary: University of Rajshahi
All Institutions: University of Rajshahi, Anan National College of Technology
The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.
The methodology is robust, employing a systematic empirical study to evaluate the impact of SAM-Audio on zero-shot ASR performance across two distinct datasets. The authors clearly outline their preprocessing pipeline, ASR models, and evaluation metrics, ensuring that the study is well-structured and reproducible. However, the reliance on a single variant of SAM-Audio due to computational constraints may limit the generalizability of the findings.
The experiments are comprehensive, covering multiple Whisper model variants and two linguistically diverse datasets. The use of WER and CER as primary metrics is appropriate for assessing ASR performance. The results consistently demonstrate that SAM-Audio preprocessing degrades ASR performance, which is a significant finding that challenges existing assumptions in the field.
The paper provides sufficient detail regarding the experimental setup, including datasets and evaluation protocols, which facilitates reproducibility. However, the lack of access to the SAM-Audio model variants used in the experiments may hinder full reproducibility for other researchers.
The study is limited by the use of only the SAM-Audio Small variant and the focus on zero-shot ASR, which may not capture the full potential of the enhancement model. Additionally, the analysis is based on two datasets, which may not encompass the full range of real-world acoustic conditions.
This research has significant implications for the field of ASR and speech enhancement, as it highlights the risks of applying denoising techniques without considering their impact on recognition accuracy. The findings encourage a reevaluation of preprocessing strategies in ASR systems, particularly in zero-shot settings. The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.
Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in speaker modelling, they often act as black-box representations with limited physical interpretability and high computational cost. In this work, a compact acoustic parameter set is investigated for vTAD. The set captures important acoustic measures and their temporal dynamics which are found to be crucial in the task. Despite its simplicity, the acoustic parameter set is competitive, outperforming conventional cepstral features and supervised DNN embeddings, and approaching state-of-the-art self-supervised models. Importantly, the studied set require no trainable parameters, incur negligible computation, and offer explicit interpretability for analysing physical traits behind human timbre perception.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a compact and interpretable acoustic parameter set for voice timbre attribute detection, which effectively competes with complex DNN-based approaches while offering significant advantages in interpretability and computational efficiency. The research addresses a critical gap in the field by providing a practical solution that balances performance with the need for understanding the underlying acoustic features relevant to human speech perception.
The paper proposes a novel approach to voice timbre attribute detection (vTAD) using a compact set of acoustic parameters that captures essential features without requiring training. This method contrasts with traditional deep neural networks (DNNs), which are often computationally intensive and lack interpretability. The methodology is well-structured, focusing on the extraction of 13 acoustic features and their temporal dynamics, leading to a 26-dimensional representation. The use of a simple Diff-Net for classification is appropriate, although the paper could benefit from more detailed descriptions of the feature extraction process and the rationale behind the choice of acoustic parameters.
The experiments are robust, utilizing a well-defined dataset (VCTK-RVA) with expert annotations, which enhances the reliability of the results. The performance metrics (Accuracy and EER) are clearly presented, showing that the proposed method competes well against established DNN-based models. However, the paper could improve by providing more comparative analysis with other state-of-the-art methods and discussing the implications of the results in greater detail.
The paper lacks sufficient implementation details that would facilitate reproducibility. While the methodology is described, specific parameters, configurations, and code availability are not mentioned, which could hinder other researchers from replicating the results.
One limitation is the reliance on a single dataset, which may affect the generalizability of the findings. Additionally, while the proposed method is interpretable, the paper does not fully explore the implications of this interpretability in practical applications. The absence of a demo or project URL also limits accessibility for further exploration of the work.
The study has significant implications for fields requiring voice analysis, such as forensics, healthcare, and human-computer interaction. The focus on interpretability and computational efficiency can lead to more accessible and user-friendly applications in speech technology. The findings could influence future research directions in audio processing and speech perception, particularly in developing systems that prioritize interpretability alongside performance. The main contribution of this paper is the introduction of a compact and interpretable acoustic parameter set for voice timbre attribute detection, which effectively competes with complex DNN-based approaches while offering significant advantages in interpretability and computational efficiency. The research addresses a critical gap in the field by providing a practical solution that balances performance with the need for understanding the underlying acoustic features relevant to human speech perception.
Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\%$ to $0.98\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.
Primary: Bursa Technical University
All Institutions: Bursa Technical University, TCG CREST, University of Eastern Finland
The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.
The paper introduces a novel cyclostationarity-inspired feature extraction framework for speech deepfake detection (SDD) that leverages spectral correlation density (SCD) to capture periodic statistical structures in speech. The methodology is well-grounded in signal processing theory, addressing the limitations of conventional time-frequency representations. The proposed two-dimensional SCD features are designed to incorporate temporal dynamics, which enhances their discriminative power. The use of multiple countermeasure architectures, including convolutional neural networks and self-supervised learning embeddings, demonstrates a comprehensive approach to evaluating the effectiveness of the proposed features.
The experiments are robust, utilizing three challenging datasets (ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5) to validate the proposed features. The results indicate significant improvements in equal error rates when combining SCD features with SSL embeddings, showcasing the complementarity of the approaches. The experimental setup is thorough, with clear metrics for performance evaluation (EER and minDCF), and the results are presented in a manner that highlights the advantages of the proposed methods over existing baselines.
The paper provides sufficient detail regarding the experimental setup, including datasets, feature extraction methods, and model architectures. However, the absence of a publicly available code repository limits the reproducibility of the results. The authors do provide a demo URL for synthesized speech, which is beneficial but does not fully compensate for the lack of code.
One limitation is the reliance on specific datasets, which may not capture the full diversity of speech deepfake scenarios. Additionally, while the results are promising, the paper does not address potential overfitting issues or the generalizability of the models to unseen spoofing techniques. The computational complexity of the SCD feature extraction process may also pose challenges for real-time applications.
The proposed methodology has significant implications for enhancing the security and trustworthiness of voice-driven technologies, particularly in applications like audio forensics and telecommunication security. By improving the detection of speech deepfakes, the research contributes to the broader field of audio signal processing and machine learning, addressing a critical need in the era of advanced synthetic media. The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.
Primary: UC -- San Diego
All Institutions: UC -- San Diego
The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
The paper introduces a novel approach to controllable audio generation through the use of Latent-Control Heads (LatCHs) and selective Training-Free Guidance (TFG). By operating directly in latent space, the proposed method significantly reduces computational overhead associated with traditional end-to-end guidance methods. The methodology is well-structured, with clear explanations of how LatCHs function and the rationale behind selective TFG. The authors provide a solid theoretical foundation, linking their work to existing literature while clearly delineating their contributions.
The experiments are comprehensive, utilizing the Stable Audio Open (SAO) dataset and comparing the proposed methods against established baselines, including end-to-end guidance and readouts. The evaluation metrics are well-defined, including both qualitative assessments (mean opinion scores) and quantitative metrics (FDopenl3, KLpasst, and CLAP). The results demonstrate that LatCHs outperform traditional methods in terms of both audio quality and computational efficiency, which is a significant achievement in the field of audio generation.
The paper provides sufficient details regarding the experimental setup, including hyperparameters and training procedures for LatCHs. However, the lack of a publicly available code repository may hinder full reproducibility. The authors do mention the datasets used, which aids in replicating the experiments, but the absence of a project URL limits access to the implementation.
One limitation is the potential challenge in generalizing the method to more complex audio generation tasks beyond the evaluated controls (intensity, pitch, and beats). Additionally, the reliance on specific feature extractors may limit the applicability of the approach to other audio domains. The authors also note that controls with greater variability, such as pitch, pose challenges, indicating room for improvement in handling such cases.
The proposed framework has significant implications for the field of generative audio, particularly in applications requiring real-time audio manipulation and control. The ability to generate high-quality audio with low computational costs can benefit various industries, including music production, gaming, and virtual reality. Furthermore, the approach could pave the way for more accessible audio generation tools for creators without extensive computational resources. The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, The Chinese University of Hong Kong, Wuhan University
The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.
The proposed methodology introduces several innovative components such as Sparse Modality Alignment (SMA), Adaptive Modulated Fusion (AMF), and Visual Unit-Guided Refinement (VUR). SMA allows for a more controlled interaction between audio and visual modalities by inserting alignment blocks into the audio encoder, which is a significant improvement over existing methods that typically rely on shallow fusion. The AMF component intelligently modulates visual feature injection based on acoustic reliability, enhancing the model's adaptability to varying input conditions. The VUR approach effectively transforms visual representations into discrete tokens for LLM rescoring, which is a novel strategy that leverages the strengths of both visual and language models. Overall, the methodology is well-structured and addresses key limitations in prior AVSR systems.
The experiments conducted on the LRS3 dataset demonstrate the effectiveness of the proposed model, achieving state-of-the-art results in various noise conditions. The reported 37% relative improvement in Word Error Rate (WER) under 0 dB SNR conditions is particularly noteworthy, showcasing the robustness of the model in challenging scenarios. The ablation studies provide additional insights into the contributions of each component, reinforcing the validity of the proposed framework. However, the paper could benefit from a more detailed discussion on the statistical significance of the results and comparisons with a broader range of existing methods.
The paper provides a comprehensive overview of the experimental setup, including details on the dataset, model architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider making the implementation accessible to facilitate validation by the research community.
One limitation of the study is the reliance on a single dataset (LRS3) for evaluation, which may not fully capture the generalizability of the model across different domains or languages. Additionally, while the method shows improvements in noise robustness, the paper does not explore the performance in extremely adverse conditions or with diverse accents and speech patterns. The computational efficiency of the proposed model, particularly in real-time applications, is also not thoroughly addressed.
The advancements in AVSR presented in this paper have significant implications for various applications, including assistive technologies for the hearing impaired, video conferencing systems, and automated transcription services. By enhancing the robustness of speech recognition in noisy environments, this research contributes to making communication technologies more accessible and effective. The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.
Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However, existing embedding-based approaches almost exclusively rely on temporal mean pooling, while alternative pooling strategies have so far only been explored for spectrogram-based representations. Consequently, the role of temporal pooling in training-free ASD with pre-trained embeddings remains insufficiently understood. In this paper, we present a systematic evaluation of temporal pooling strategies across multiple state-of-the-art audio embedding models. We propose relative deviation pooling (RDP), an adaptive pooling method that emphasizes informative temporal deviations, and introduce a hybrid pooling strategy that combines RDP with generalized mean pooling. Experiments on five benchmark datasets demonstrate that the proposed methods consistently outperform mean pooling and achieve state-of-the-art performance for training-free ASD, including results that surpass all previously reported trained systems and ensembles on the DCASE2025 ASD dataset.
Primary: Aalborg University
All Institutions: Aalborg University, Pioneer Centre for Artificial Intelligence
The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.
The paper introduces relative deviation pooling (RDP) and a hybrid pooling strategy that combines RDP with generalized mean pooling (GEM). This approach emphasizes informative temporal deviations, addressing a significant gap in the current understanding of temporal pooling in training-free anomalous sound detection (ASD). The systematic evaluation of various pooling strategies across multiple state-of-the-art audio embedding models is a strong methodological contribution, as it not only highlights the importance of pooling mechanisms but also provides a framework for future research in this area.
The experiments are conducted on five benchmark datasets, demonstrating that the proposed methods consistently outperform traditional mean pooling and achieve state-of-the-art performance for training-free ASD. The results are rigorously analyzed, showing significant improvements over existing methods, including previously reported trained systems. The paper includes comprehensive comparisons and ablation studies, validating the effectiveness of the proposed pooling strategies.
The paper provides detailed descriptions of the datasets, experimental setup, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or demo URLs limits the ability for others to directly replicate the findings. The authors mention the use of specific hyperparameters but do not provide a repository for the implementation, which could be a barrier for reproducibility.
One limitation is the reliance on pre-trained audio embedding models, which may not generalize well to all types of anomalous sounds. Additionally, while the proposed pooling strategies show significant improvements, the paper does not explore the potential of integrating these methods into supervised or semi-supervised frameworks, which could further enhance performance. The focus on training-free methods may also limit applicability in scenarios where labeled data is available.
The findings have significant implications for real-world applications in anomaly detection, particularly in industrial settings where rapid deployment and robustness to domain shifts are critical. The proposed methods could lead to more effective monitoring systems for machinery and environmental sounds, potentially reducing downtime and improving safety. The emphasis on training-free approaches also opens avenues for applications in resource-constrained environments. The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.
Primary: &D Institute UK (SRUK)
All Institutions: &D Institute UK (SRUK), Mobile eXperience Business, Republic of Korea
The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.
The proposed FlowW2N method introduces a novel conditional flow matching approach that effectively addresses the challenges of whispered-to-normal speech conversion, particularly the temporal misalignment and lack of paired data. By leveraging synthetic data and domain-invariant ASR embeddings, the authors successfully sidestep traditional alignment issues, which is a significant advancement in the field. The architecture employs a Diffusion Transformer and utilizes a Gaussian prior for generation, which is innovative and well-justified. The methodology is clearly articulated, with a systematic exploration of different conditioning mechanisms and layer selection criteria that enhance the model's performance.
The experiments are comprehensive, utilizing two well-established datasets (CHAINS and wTIMIT) to evaluate the model's performance. The results demonstrate a significant reduction in Word Error Rate (WER) compared to prior methods, achieving state-of-the-art intelligibility. The paper includes ablation studies that provide insights into the contributions of various components of the model, reinforcing the robustness of the findings. The evaluation metrics are appropriate and well-defined, ensuring that the results are credible and reproducible.
While the paper provides a detailed description of the methodology and experimental setup, it lacks a publicly available code repository or demo URL, which hinders reproducibility. The authors mention using internal generative AI tools for language refinement, but there is no indication of whether the model or data will be made available for further research.
One limitation is the reliance on synthetic data for training, which may not fully capture the complexities of real-world whispered speech. Additionally, while the model shows impressive performance on the evaluated datasets, its generalizability to other languages or dialects is not addressed. The absence of a demo or code repository also limits the accessibility of the research for further validation by the community.
The implications of this research are significant, particularly in applications involving speech recognition and synthesis for individuals with speech impairments or in noisy environments. The ability to convert whispered speech to normal speech could enhance communication for those who rely on whispering due to various reasons, thus broadening accessibility in technology. The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.
Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Code and models will be released upon acceptance.
Primary: NVIDIA
All Institutions: NVIDIA, Academia Sinica, Taipei, Taiwan
The main contribution of this paper is the introduction of a novel approach to Universal Speech Enhancement that significantly improves speech quality and ASR performance by rethinking training targets and leveraging a two-stage model framework. This work addresses critical gaps in the field and provides a solid foundation for future research and applications in speech processing.
The paper presents a systematic approach to Universal Speech Enhancement (USE) by addressing three critical challenges: training target selection, the distortion-perception tradeoff, and data quality. The authors propose using time-shifted anechoic clean speech as a learning target, which is shown to outperform conventional early-reflected speech. They also introduce a two-stage framework that combines regression and generative models to balance fidelity and perceptual quality effectively. This methodology is well-grounded in theoretical principles and is supported by empirical evidence.
The experiments are comprehensive, utilizing the URGENT 2025 Challenge dataset, which includes diverse speech distortions and languages. The authors provide detailed results that demonstrate significant improvements in both perceptual quality and automatic speech recognition (ASR) performance. The evaluation metrics are robust, covering both intrusive and non-intrusive measures, which strengthens the validity of their findings.
The authors commit to releasing their code and models upon acceptance, which is a positive step towards reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, such as hyperparameters and specific training procedures, to facilitate replication.
One notable limitation is the reliance on the URGENT 2025 Challenge dataset, which may not fully represent real-world conditions. Additionally, while the proposed method shows improvements, the paper does not extensively discuss scenarios where the model may fail or the potential for overfitting to the training data.
The advancements in speech enhancement have significant implications for various applications, including telecommunications, assistive technologies, and improving the quality of training data for text-to-speech systems. The language-agnostic nature of the proposed method could also benefit low-resource languages, enhancing accessibility and communication. The main contribution of this paper is the introduction of a novel approach to Universal Speech Enhancement that significantly improves speech quality and ASR performance by rethinking training targets and leveraging a two-stage model framework. This work addresses critical gaps in the field and provides a solid foundation for future research and applications in speech processing.
This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model's ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.
Primary: Victoria University of Wellington
All Institutions: Victoria University of Wellington, GN ReSound
The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.
The paper proposes a novel approach to own voice detection (OVD) in hearing aids using a single microphone by leveraging simulated acoustic transfer functions (ATFs) for data augmentation. The methodology is well-structured, involving a two-stage simulation-based ATF generation pipeline that transitions from a rigid-sphere model to a detailed head-and-torso representation. The use of a transformer-based classifier enhances the model's ability to learn from spatial propagation cues, which is a significant advancement over traditional methods that rely on multiple microphones or complex signal processing techniques. The hierarchical adaptation strategy employed to progressively fine-tune the model is a commendable aspect, allowing for improved generalization from simulated to real-world conditions.
The experimental results demonstrate high accuracy rates, achieving 95.52% on simulated head-and-torso test data and 80% on real hearing aid recordings without fine-tuning. The use of diverse datasets, including VoxCeleb1 and LibriSpeech, alongside real-world recordings, adds robustness to the evaluation. The model's performance under varying noise conditions was also assessed, showcasing its resilience, which is crucial for practical applications in hearing aids.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific URLs or repositories for code and data, which would enhance reproducibility. The absence of a demo or project URL limits the ability for other researchers to replicate the findings directly. However, the comprehensive description of the data augmentation process and model training strategies offers a solid foundation for future implementations.
One limitation is the reliance on simulated data for training, which may not fully capture the complexities of real-world acoustic environments. The model's performance on real recordings, while promising, may still be affected by factors not accounted for in the simulations. Additionally, the study focuses on offline segment-level detection, leaving out considerations for real-time applications, which are critical in hearing aid technology.
The proposed method has significant implications for the design of hearing aids, particularly in enhancing user comfort and speech intelligibility without increasing device complexity or cost. By enabling effective OVD with a single microphone, this research could lead to more accessible hearing aid solutions for individuals with hearing impairments, potentially improving their quality of life. The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.
Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.
Primary: PARLO Institute for Research and Teaching in Speech Therapy
All Institutions: PARLO Institute for Research and Teaching in Speech Therapy
The main contribution of this paper is the introduction of the PARLO Dementia Corpus, a clinically validated German resource for Alzheimer's disease research, which addresses a critical gap in available datasets for non-English languages. This work provides a comprehensive framework for future studies in speech-based cognitive assessment and establishes a benchmark for multi-modal research in neurodegenerative diseases.
The methodology presented in the paper is robust, involving the collection of a diverse dataset from multiple centers, which enhances the generalizability of the findings. The use of a standardized test battery for eliciting speech data is a significant strength, as it allows for systematic comparisons across different cognitive tasks. The detailed transcription process and the inclusion of demographic and clinical metadata further enrich the dataset, making it a valuable resource for future research. The integration of automatic speech recognition (ASR) systems and large language models (LLMs) for cognitive assessment demonstrates a forward-thinking approach, leveraging current advancements in AI.
The experiments conducted provide a solid foundation for evaluating the utility of the PARLO Dementia Corpus. The ASR benchmarking results are particularly noteworthy, showing a clear correlation between cognitive status and transcription accuracy. The automatic test evaluation results validate the effectiveness of the proposed scoring methods, achieving high correlation coefficients with human evaluations. The LLM-based classification experiments illustrate the potential for automated cognitive assessment, although the zero-shot classification approach may benefit from further refinement and validation.
The paper outlines a clear methodology for data collection, transcription, and experimental setup, which supports reproducibility. However, the lack of publicly available code or a project URL limits the ease with which other researchers can replicate the experiments. Providing access to the dataset and a detailed description of the experimental setup would enhance reproducibility.
One limitation of the study is the relatively small sample size of 208 participants, which may affect the statistical power of the findings. Additionally, while the dataset is a significant step forward for German-language research, it may not fully capture the diversity of speech patterns across different demographics or regions within Germany. The reliance on ASR systems also introduces potential biases, as these systems may struggle with disordered speech typical of dementia patients.
The PARLO Dementia Corpus has the potential to significantly impact the field of cognitive impairment research, particularly in non-English speaking populations. It opens avenues for the development of automated screening tools that could facilitate early detection of Alzheimer's disease, ultimately improving patient outcomes. The dataset's compatibility with existing English-language resources enhances its utility for cross-lingual research, promoting a more inclusive approach to cognitive health studies. The main contribution of this paper is the introduction of the PARLO Dementia Corpus, a clinically validated German resource for Alzheimer's disease research, which addresses a critical gap in available datasets for non-English languages. This work provides a comprehensive framework for future studies in speech-based cognitive assessment and establishes a benchmark for multi-modal research in neurodegenerative diseases.
Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6\%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power (\(AUC\): 0.769 and 0.746, respectively) and substantial specificity disparities (\(Δ_{spec}\) up to 18\% and 15\%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.
Primary: University of Antioquia
All Institutions: University of Antioquia, Technische Hochschule Nürnberg, Friedrich-Alexander Universität Erlangen-Nürnberg
This study systematically investigates bias in self-supervised acoustic representations for cognitive impairment detection, revealing significant performance disparities across demographic and clinical subgroups. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights into the fairness and reliability of machine learning models in clinical applications.
The methodology is robust, employing a systematic bias analysis of acoustic representations for cognitive impairment detection. The comparison of traditional acoustic features with self-supervised embeddings from Wav2Vec 2.0 is well-structured, and the evaluation across demographic and clinical subgroups adds significant depth to the analysis. The use of multiple classifiers and detailed bias metrics enhances the rigor of the methodology.
The experiments are comprehensive, utilizing a well-defined dataset (DementiaBank Pitt Corpus) and addressing class imbalance through various balancing strategies. The results are clearly presented, demonstrating the performance of different acoustic features and classifiers, with a focus on subgroup-specific metrics. However, the performance for depression detection is notably lower, which raises questions about the model's effectiveness in this area.
The paper provides a GitHub repository with source code and audio filenames, which supports reproducibility. However, the raw audio recordings cannot be shared due to restrictions, which may limit the ability of other researchers to fully replicate the study.
The study is limited by its reliance on a single dataset, which may not capture the full demographic and linguistic diversity of clinical populations. Additionally, the small number of depression-labeled samples in the dataset may affect the robustness of the conclusions regarding depression classification.
The findings highlight critical issues related to bias and fairness in machine learning applications in healthcare, particularly in speech-based diagnostics for cognitive impairment. The work underscores the importance of fairness-aware evaluation protocols, which could influence future research and clinical practices in AI-driven healthcare solutions. This study systematically investigates bias in self-supervised acoustic representations for cognitive impairment detection, revealing significant performance disparities across demographic and clinical subgroups. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights into the fairness and reliability of machine learning models in clinical applications.
Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.
Primary: PARLO Institute for Research and Teaching in Speech Therapy
All Institutions: PARLO Institute for Research and Teaching in Speech Therapy
The paper introduces the PARLO Dementia Corpus, a pioneering resource for Alzheimer's disease research in German, enabling innovative approaches to cognitive assessment through speech analysis. The comprehensive dataset and its validation through rigorous experiments position it as a valuable contribution to the fields of speech technology and clinical neuroscience.
The methodology is robust, involving a multi-center design that ensures diversity in the dataset. The use of standardized neuropsychological tasks for speech elicitation is a significant strength, as it allows for a comprehensive assessment of cognitive function. The detailed transcription process enhances the dataset's utility for various analyses. However, the reliance on a specific demographic (German-speaking individuals) may limit generalizability to other languages and cultures.
The experiments conducted, including ASR benchmarking and LLM-based classification, are well-structured and demonstrate the dataset's applicability. The results indicate a clear correlation between automatic evaluations and human assessments, validating the dataset's potential for clinical applications. The choice of models and the evaluation metrics used are appropriate, though further exploration of different ASR systems could enhance understanding of performance across varied conditions.
The paper provides sufficient detail regarding the data collection, transcription, and experimental setup, which supports reproducibility. However, the absence of publicly accessible code or a demo for the models used limits the ease with which other researchers can replicate the findings.
The study's limitations include the potential biases inherent in a multi-center study, such as variability in participant recruitment and testing conditions. Additionally, while the dataset is comprehensive, it may not capture the full spectrum of cognitive impairment across different languages or cultural contexts. The focus on German may restrict broader applicability.
The PARLO Dementia Corpus has significant implications for both clinical and research applications in the field of cognitive impairment. By providing a publicly available dataset, it facilitates advancements in automatic speech recognition and cognitive assessment tools, potentially leading to earlier detection and better management of Alzheimer's disease. The corpus also sets a precedent for future multilingual studies in speech analysis related to cognitive health. The paper introduces the PARLO Dementia Corpus, a pioneering resource for Alzheimer's disease research in German, enabling innovative approaches to cognitive assessment through speech analysis. The comprehensive dataset and its validation through rigorous experiments position it as a valuable contribution to the fields of speech technology and clinical neuroscience.
Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.
Primary: National Institute of Informatics
All Institutions: National Institute of Informatics
The main contribution of this paper is the introduction of a reinforcement learning-based fine-tuning approach for speech deepfake detection, demonstrating improved generalization capabilities compared to traditional supervised methods. This work significantly advances the understanding of model training in the context of speech deepfake detection and opens avenues for future research in applying reinforcement learning techniques to other domains within machine learning.
The methodology employed in this paper is robust, leveraging a novel approach of applying Group Relative Policy Optimization (GRPO) for fine-tuning speech deepfake detection models. The authors effectively draw parallels between the fine-tuning processes in speech and large language models, providing a clear rationale for their choice of GRPO over traditional supervised fine-tuning (SFT). The detailed description of the training paradigm, including the formulation of the loss functions and the experimental setup, demonstrates a comprehensive understanding of the problem space. However, the paper could benefit from clearer visual aids and more explicit definitions of the terms used in the equations, which may enhance reader comprehension.
The experimental evaluation is thorough, utilizing multiple detectors and diverse test sets to assess the performance of GRPO against SFT and hybrid setups. The results are well-presented, showing clear improvements in out-of-domain generalization without sacrificing in-domain performance. The use of ablation studies to isolate the effects of the negative reward and regularization term adds depth to the analysis. However, the paper lacks a comparative analysis with other state-of-the-art methods in speech deepfake detection, which could contextualize the results more effectively.
The paper outlines the experimental setup and model configurations in detail, which is essential for reproducibility. However, the lack of publicly available code or a clear project URL limits the ability for other researchers to replicate the findings. Providing a GitHub repository with the implementation would significantly enhance reproducibility and foster further research in this area.
One limitation is the absence of a comparative analysis with other advanced techniques in the field of speech deepfake detection, which could provide a more comprehensive understanding of the GRPO approach's relative performance. Additionally, the paper does not address the potential computational overhead introduced by the GRPO fine-tuning process compared to SFT, which could be a consideration for practical applications.
The findings of this research have significant implications for the field of speech deepfake detection, particularly in enhancing the robustness of models against unseen attacks. As deepfake technology continues to evolve, improving detection methods is crucial for maintaining the integrity of audio content across various applications, including media, security, and communication. The insights gained from this study could inspire further innovations in model training paradigms and contribute to the development of more resilient AI systems. The main contribution of this paper is the introduction of a reinforcement learning-based fine-tuning approach for speech deepfake detection, demonstrating improved generalization capabilities compared to traditional supervised methods. This work significantly advances the understanding of model training in the context of speech deepfake detection and opens avenues for future research in applying reinforcement learning techniques to other domains within machine learning.