Audio ML Papers

Towards Objective Gastrointestinal Auscultation: Automated Segmentation and Annotation of Bowel Sound Patterns

Zahra Mansour, Verena Uslar, Dirk Weyhe ... · arXiv

Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable aut...

Bowel sounds (BS) are typically momentary and have low amplitude, making them difficult to detect accurately through manual auscultation. This leads to significant variability in clinical assessment. Digital acoustic sensors allow the acquisition of high-quality BS and enable automated signal analysis, offering the potential to provide clinicians with both objective and quantitative feedback on bowel activity. This study presents an automated pipeline for bowel sound segmentation and classification using a wearable acoustic SonicGuard sensor. BS signals from 83 subjects were recorded using a SonicGuard sensor. Data from 40 subjects were manually annotated by clinical experts and used to train an automatic annotation algorithm, while the remaining subjects were used for further model evaluation. An energy-based event detection algorithm was developed to detect BS events. Detected sound segments were then classified into BS patterns using a pretrained Audio Spectrogram Transformer (AST) model. Model performance was evaluated separately for healthy individuals and patients. The best configuration used two specialized models, one trained on healthy subjects and one on patients, achieving (accuracy: 0.97, AUROC: 0.98) for healthy group and (accuracy: 0.96, AUROC: 0.98) for patient group. The auto-annotation method reduced manual labeling time by approximately 70%, and expert review showed that less than 12% of automatically detected segments required correction. The proposed automated segmentation and classification system enables quantitative assessment of bowel activity, providing clinicians with an objective diagnostic tool that may improve the diagnostic of gastrointestinal function and support the annotation of large-scale datasets.

Institutional Affiliations

Primary: Carl von Ossietzky Universität Oldenburg

All Institutions: Carl von Ossietzky Universität Oldenburg, PIUS Hospital

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the development of an automated pipeline for bowel sound segmentation and classification that integrates advanced machine learning techniques with a wearable acoustic sensor, addressing the challenges of subjective auscultation in clinical practice. The comprehensive methodology and promising results indicate a significant step forward in the objective assessment of gastrointestinal function, with potential implications for clinical diagnostics and research.

Comprehensive Analysis

Methodology Assessment

The paper presents a comprehensive automated pipeline for bowel sound segmentation and classification, utilizing a wearable acoustic sensor. The methodology is well-structured, combining an energy-based event detection algorithm with advanced deep learning models (Audio Spectrogram Transformer and Wav2Vec). The approach is innovative in its integration of cohort-specific models to account for differences between healthy individuals and patients, which is a significant advancement over previous works that did not consider such variability. The detailed description of the event detection algorithm, including the use of RMS amplitude and energy variations, demonstrates a thoughtful approach to addressing the challenges posed by the heterogeneous nature of bowel sounds.

Experimental Evaluation

The experiments are robust, involving recordings from a diverse set of subjects (both healthy and patients) and a well-defined evaluation protocol. The performance metrics (accuracy and AUROC) indicate strong model performance, particularly with the AST model achieving high accuracy rates (0.97 for healthy subjects and 0.96 for patients). The use of expert-reviewed annotations adds credibility to the evaluation process. However, the paper could benefit from additional comparative analyses with other state-of-the-art methods to further validate the proposed approach.

Reproducibility

The authors provide a GitHub repository for the implementation of their approach, which is a positive aspect for reproducibility. However, the paper lacks detailed information on the specific experimental setup, such as hyperparameter tuning and training procedures, which could hinder full reproducibility by other researchers.

Limitations

The study acknowledges limitations, such as the tendency of the auto-annotation framework to truncate certain event durations, particularly for the MB class. Additionally, the reliance on a relatively small dataset for training and evaluation may affect the generalizability of the model. The authors could also explore the impact of noise and other external factors on the model's performance in real-world clinical settings.

Broader Impact

The proposed automated system has significant potential applications in clinical settings, providing objective and quantitative assessments of bowel sounds that could enhance diagnostic accuracy and efficiency. By reducing the workload on clinicians and enabling the analysis of large datasets, this work could facilitate improved patient monitoring and treatment strategies in gastrointestinal care. The development of such tools aligns with the growing trend towards digital health and personalized medicine. The main contribution of this paper is the development of an automated pipeline for bowel sound segmentation and classification that integrates advanced machine learning techniques with a wearable acoustic sensor, addressing the challenges of subjective auscultation in clinical practice. The comprehensive methodology and promising results indicate a significant step forward in the objective assessment of gastrointestinal function, with potential implications for clinical diagnostics and research.

Analysis: Full Paper • Full text: 24,909 characters

Seeing the Context: Rich Visual Context-Aware Speech Recognition via Multimodal Reasoning

Wenjie Tian, Mingchen Shao, Bingshen Mu ... · arXiv

Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR inclu...

Audio-visual speech recognition (AVSR) is an extension of ASR that incorporates visual signals. Current AVSR approaches primarily focus on lip motion, largely overlooking rich context present in the video such as speaking scene and on-screen text. To tackle such CAVSR (AVSR including rich visual Context), we propose VASR designed to "see" and reason the visual context to improve speech recognition. Specifically, we construct an Audio-Visual Chain-of-Thought (AV-CoT) that explicitly enforces intermediate cross-modal grounding between acoustic signals and visual evidence. This evidence-driven reasoning mitigates the "single-modality dominance" problem, where models either over-rely on visual context or fail to utilize it. Besides, to address the data scarcity, we construct and release a corresponding data pipeline and test set. Experiments show that AV-CoT effectively mitigates the single-modality dominance, achieving state-of-the-art performance in CAVSR. The project is open-sourced.

Institutional Affiliations

Primary: Northwestern Polytechnical University

All Institutions: Northwestern Polytechnical University

GitHub

ML Relevance Analysis (82)

The paper presents a novel approach to context-aware audio-visual speech recognition by leveraging rich visual context through a structured reasoning framework. This work significantly advances the field by addressing the limitations of existing AVSR methods and providing a comprehensive dataset for future research.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces the Audio-Visual Chain-of-Thought (AV-CoT) framework, which is a structured approach to integrate visual context into speech recognition tasks. This is a significant advancement over traditional AVSR methods that primarily focus on lip movements. The three-step process of Perception, Reasoning, and Transcription is well-defined, allowing for a systematic approach to disambiguate speech using multimodal inputs. The authors also address the challenge of data scarcity by developing a scalable data pipeline, which is a commendable effort in enhancing the dataset quality for CAVSR tasks.

Experimental Evaluation

The experiments are thorough, demonstrating the effectiveness of the VASR model against several strong baselines. The use of character error rate (CER) as a metric is appropriate for the task, and the results indicate a significant performance improvement over existing models. The ablation studies provide additional insights into the importance of the AV-CoT mechanism, reinforcing the claims made about its effectiveness in mitigating single-modality dominance.

Reproducibility

The authors provide sufficient implementation details, including the model architecture, training parameters, and data processing pipeline. However, the reproducibility could be enhanced by providing more detailed descriptions of the datasets used and ensuring that all code and data are readily accessible for independent verification.

Limitations

One notable limitation is the reliance on the Qwen2.5-Omni model, which has a low frame rate for visual encoding, potentially impacting the performance of the lip-reading task. Additionally, the paper does not address the potential biases that may arise from the datasets used, which could affect the generalizability of the results.

Broader Impact

The research has significant implications for improving speech recognition systems, particularly in contexts where visual cues are abundant. This could enhance accessibility for individuals with hearing impairments and improve user experience in various multimedia applications. The open-sourcing of the dataset and code also promotes further research in this area. The paper presents a novel approach to context-aware audio-visual speech recognition by leveraging rich visual context through a structured reasoning framework. This work significantly advances the field by addressing the limitations of existing AVSR methods and providing a comprehensive dataset for future research.

Analysis: Full Paper • Full text: 18,923 characters

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Gaia A. Bertolino, Yuwei Zhang, Tong Xia ... · arXiv

Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings...

Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.

Institutional Affiliations

Primary: Tsinghua University

All Institutions: Tsinghua University, University of Calabria, University of Cambridge

GitHub

ML Relevance Analysis (84)

The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.

Comprehensive Analysis

Methodology Assessment

The proposed RAMoEA-QA model introduces a novel hierarchical specialization approach that employs a two-stage conditional specialization mechanism, utilizing an Audio Mixture-of-Experts and a Language Mixture-of-Adapters. This design allows the model to effectively handle the diverse nature of respiratory audio data and various query intents, which is a significant advancement over existing monolithic biomedical audio-language QA systems. The use of pre-trained audio encoders and LoRA adapters on a frozen LLM demonstrates a thoughtful integration of state-of-the-art techniques while maintaining a low parameter overhead.

Experimental Evaluation

The paper presents a comprehensive experimental setup, comparing RAMoEA-QA against strong baselines and conducting ablation studies to validate the effectiveness of the routing mechanisms. The reported in-domain test accuracy of 0.72 significantly surpasses the state-of-the-art baselines (0.61 and 0.67), indicating robust performance. The experiments also address generalization across different domains, modalities, and tasks, which is critical for real-world applications in healthcare.

Reproducibility

The authors provide a link to their code repository, which is essential for reproducibility. However, the paper could benefit from additional details regarding the implementation specifics, such as hyperparameter settings and training procedures, to facilitate easier replication of results by other researchers.

Limitations

One limitation noted is the reliance on the RA-QA collection, which may not encompass the full diversity of respiratory audio data encountered in practice. Additionally, while the model shows strong performance in controlled settings, its robustness in highly variable real-world environments remains to be fully validated.

Broader Impact

The RAMoEA-QA model has significant potential applications in healthcare, particularly in respiratory care, where it can enhance patient monitoring and screening through scalable audio analysis. Its ability to handle diverse audio inputs and question formats could lead to more effective and personalized patient interactions, ultimately improving healthcare outcomes. The main contribution of this paper is the introduction of RAMoEA-QA, a hierarchically specialized model for respiratory audio question answering that effectively addresses the challenges posed by heterogeneous audio data and diverse query intents. This work represents a meaningful advancement in the integration of machine learning and healthcare, demonstrating both innovative methodology and impactful results.

Analysis: Full Paper • Full text: 2,588 characters

Which Data Matter? Embedding-Based Data Selection for Speech Recognition

Zakaria Aldeneh, Skyler Seto, Maureen de Seyssel ... · arXiv

Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specia...

Modern ASR systems are typically trained on large-scale pseudo-labeled, in-the-wild data spanning multiple domains. While such heterogeneous data benefit generalist models designed for broad deployment, they pose challenges for specialist models targeting specific domains: specialist models lack the capacity to learn from all available data, and one must pay closer attention to addressing the mismatch between training and test conditions. In this work, we study targeted data selection as a strategy to address these challenges, selecting relevant subsets from 100k hours of in-the-wild training data to optimize performance on target domains. We represent speech samples using embeddings that capture complementary characteristic--speaker attributes, phonetic content, and semantic meaning--and analyze how relevance and diversity along these axes when performing data selection affect downstream ASR performance. Our experiments with CTC-based Conformer models show that training on a strategically selected 5% subset can exceed the performance of models trained on the full dataset by up to 36.8% relative WER reduction on target domains.

Institutional Affiliations

Primary: Carnegie Mellon University

All Institutions: Carnegie Mellon University

ML Relevance Analysis (84)

This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel approach to data selection for ASR systems using embedding-based methods that capture speaker, phonetic, and semantic characteristics. The use of Maximal Marginal Relevance (MMR) to balance relevance and diversity in data selection is a significant methodological advancement. The multi-embedding and multi-target strategies enhance the robustness of the approach, allowing for effective training on large-scale, heterogeneous datasets. The methodology is well-structured, with clear definitions and mathematical formulations that enhance clarity and reproducibility.

Experimental Evaluation

The experiments are comprehensive, utilizing multiple target datasets (LibriSpeech, CommonVoice, TED-LIUM) to validate the effectiveness of the proposed data selection methods. The results demonstrate substantial improvements in word error rate (WER) when using strategically selected subsets compared to random selections and the full dataset. The experiments are well-designed, with appropriate controls and comparisons that provide strong evidence for the claims made.

Reproducibility

The paper provides sufficient details regarding the implementation, including model architectures, training procedures, and data selection algorithms. However, the lack of publicly available code or datasets limits reproducibility. The use of specific embeddings and the complexity of the MMR selection process may pose challenges for others attempting to replicate the results without access to the same resources.

Limitations

The paper acknowledges the computational expense of the greedy MMR procedure and the potential for label noise in the pseudo-labeled Granary dataset. Additionally, the reliance on embedding-based selection may not generalize across all domains or datasets, and the performance may vary based on the characteristics of the target domain.

Broader Impact

The findings have significant implications for the deployment of ASR systems in specialized domains, particularly in scenarios where labeled data is scarce. The ability to effectively select relevant training data can enhance the performance of models in real-world applications, making this research highly relevant to both academia and industry. The approach may also inspire further research into data selection strategies in other machine learning domains. This paper contributes to the field by proposing a robust embedding-based data selection method for ASR systems that addresses domain mismatch challenges, demonstrating significant performance improvements across various datasets. The comprehensive methodology and experimental validation provide a strong foundation for future research in data selection and ASR model training.

Analysis: Full Paper • Full text: 32,282 characters

Do Compact SSL Backbones Matter for Audio Deepfake Detection? A Controlled Study with RAPTOR

Ajinkya Kulkarni, Sandipana Dowerah, Atharva Kulkarni ... · arXiv

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a contro...

Self-supervised learning (SSL) underpins modern audio deepfake detection, yet most prior work centers on a single large wav2vec2-XLSR backbone, leaving compact under studied. We present RAPTOR, Representation Aware Pairwise-gated Transformer for Out-of-domain Recognition a controlled study of compact SSL backbones from the HuBERT and WavLM within a unified pairwise-gated fusion detector, evaluated across 14 cross-domain benchmarks. We show that multilingual HuBERT pre-training is the primary driver of cross-domain robustness, enabling 100M models to match larger and commercial systems. Beyond EER, we introduce a test-time augmentation protocol with perturbation-based aleatoric uncertainty to expose calibration differences invisible to standard metrics: WavLM variants exhibit overconfident miscalibration under perturbation, whereas iterative mHuBERT remains stable. These findings indicate that SSL pre-training trajectory, not model scale, drives reliable audio deepfake detection.

Institutional Affiliations

Primary: Idiap Research Institute

All Institutions: Idiap Research Institute, Tallinn University of Technology

ML Relevance Analysis (83)

The main contribution of this paper is the demonstration that compact SSL backbones can achieve competitive performance in audio deepfake detection through careful pre-training strategies, while also introducing a novel method for assessing model calibration under distributional shifts. This work significantly advances the understanding of how SSL pre-training affects model robustness and reliability in practical applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces RAPTOR, a pairwise-gated hierarchical layer-fusion architecture, to evaluate the performance of compact self-supervised learning (SSL) backbones for audio deepfake detection. The methodology is robust, employing a controlled experimental setup where only the SSL encoder is varied while keeping the downstream detection framework constant. This approach allows for a clear analysis of the impact of different pre-training strategies on model performance. The introduction of test-time augmentation (TTA) for uncertainty estimation is particularly noteworthy, as it provides a novel way to assess model calibration beyond traditional metrics.

Experimental Evaluation

The authors conduct extensive experiments across 14 cross-domain benchmarks, which is a significant contribution to the field as it highlights the robustness of the proposed models under varying conditions. The results demonstrate that the compact models can achieve competitive performance compared to larger models, which is an important finding for practical applications. The use of multiple evaluation metrics, including EER and pooled EER, adds depth to the analysis and provides a comprehensive view of model performance.

Reproducibility

The paper provides sufficient implementation details, including training protocols, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the findings.

Limitations

One limitation of the study is the reliance on specific datasets for training and evaluation, which may not fully capture the diversity of real-world audio deepfake scenarios. Additionally, the paper acknowledges the need for further investigation into the sensitivity-diversity trade-off observed in the final mHuBERT checkpoint.

Broader Impact

The findings of this research have significant implications for the field of audio deepfake detection, particularly in enhancing the reliability of detection systems in real-world applications. The emphasis on model calibration and the effectiveness of compact models could lead to more accessible and efficient solutions for combating audio deepfakes. The main contribution of this paper is the demonstration that compact SSL backbones can achieve competitive performance in audio deepfake detection through careful pre-training strategies, while also introducing a novel method for assessing model calibration under distributional shifts. This work significantly advances the understanding of how SSL pre-training affects model robustness and reliability in practical applications.

Analysis: Full Paper • Full text: 26,649 characters

How Well Do Current Speech Deepfake Detection Methods Generalize to the Real World?

Daixian Li, Jun Xue, Yanzhen Ren ... · arXiv

Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors c...

Recent advances in speech synthesis and voice conversion have greatly improved the naturalness and authenticity of generated audio. Meanwhile, evolving encoding, compression, and transmission mechanisms on social media platforms further obscure deepfake artifacts. These factors complicate reliable detection in real-world environments, underscoring the need for representative evaluation benchmarks. To this end, we introduce ML-ITW (Multilingual In-The-Wild), a multilingual dataset covering 14 languages, seven major platforms, and 180 public figures, totaling 28.39 hours of audio. We evaluate three detection paradigms: end-to-end neural models, self-supervised feature-based (SSL) methods, and audio large language models (Audio LLMs). Experimental results reveal significant performance degradation across diverse languages and real-world acoustic conditions, highlighting the limited generalization ability of existing detectors in practical scenarios. The ML-ITW dataset is publicly available.

Institutional Affiliations

Primary: Wuhan University

All Institutions: Wuhan University

GitHub

ML Relevance Analysis (83)

The main contribution of this work is the introduction of the ML-ITW dataset, which provides a realistic benchmark for evaluating speech deepfake detection systems across multiple languages and platforms. This comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the pressing need for improved detection mechanisms in the face of rapidly advancing speech synthesis technologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel dataset, ML-ITW, which is a significant advancement in the field of speech deepfake detection. The methodology for dataset construction is robust, utilizing a diverse range of social media platforms and languages, which enhances the realism of the evaluation scenarios. The evaluation of various detection paradigms, including end-to-end models, self-supervised methods, and audio large language models, is comprehensive and well-structured. The approach to validating spoofed samples is methodical, ensuring high-quality data for training and evaluation.

Experimental Evaluation

The experiments are thorough, comparing multiple models across different datasets, including ASVspoof2019-LA, ITW, and ML-ITW. The results clearly demonstrate the performance degradation of existing models when faced with real-world conditions, highlighting the limitations of current benchmarks. The use of standard metrics (EER, AUC, ACC, F1) adds rigor to the evaluation, although the paper could benefit from more detailed statistical analysis of the results to strengthen claims about generalization gaps.

Reproducibility

The paper provides sufficient details regarding the dataset construction, model training, and evaluation protocols, which would allow other researchers to replicate the experiments. However, the absence of a direct implementation link or code repository limits the ease of reproducibility.

Limitations

One notable limitation is the relatively small sample size for some low-resource languages, which may affect the reliability of language-specific analyses. Additionally, while the dataset is comprehensive, the evolving nature of speech synthesis technologies means that the dataset may quickly become outdated, necessitating continuous updates.

Broader Impact

The findings of this research have significant implications for the development of robust deepfake detection systems. By highlighting the importance of realistic evaluation benchmarks, the study encourages future research to focus on generalization across diverse conditions, ultimately contributing to the enhancement of security measures against identity impersonation and misinformation. The main contribution of this work is the introduction of the ML-ITW dataset, which provides a realistic benchmark for evaluating speech deepfake detection systems across multiple languages and platforms. This comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the pressing need for improved detection mechanisms in the face of rapidly advancing speech synthesis technologies.

Analysis: Full Paper • Full text: 19,712 characters

Reconstruct! Don't Encode: Self-Supervised Representation Reconstruction Loss for High-Intelligibility and Low-Latency Streaming Neural Audio Codec

Junhyeok Lee, Xiluo He, Jihwan Lee ... · arXiv

Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that sel...

Neural audio codecs optimized for mel-spectrogram reconstruction often fail to preserve intelligibility. While semantic encoder distillation improves encoded representations, it does not guarantee content preservation in reconstructed speech. In this work, we demonstrate that self-supervised representation reconstruction (SSRR) loss fundamentally improves codec training and performance. First, SSRR significantly accelerates convergence, enabling competitive results using only a single GPU. Second, it enhances intelligibility by reconstructing distilled self-supervised representations from codec outputs. Third, SSRR enables high intelligibility without additional lookahead in streaming Transformer-based codecs, allowing a zero-lookahead architecture for real-time deployment. As a result, our JHCodec achieves state-of-the-art performance while maintaining minimal latency and reduced training cost. We open-source the full implementation, training pipeline, and demo on Github https://github.com/jhcodec843/jhcodec.

Institutional Affiliations

Primary: University of Southern California

All Institutions: University of Southern California, Johns Hopkins University

Demo · GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a self-supervised representation reconstruction loss that significantly enhances the performance of neural audio codecs in terms of intelligibility and latency. This work represents a meaningful advancement in the field of audio processing, providing a practical solution for real-time applications while also contributing to the theoretical understanding of codec training methodologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel self-supervised representation reconstruction (SSRR) loss that improves the training of neural audio codecs. The methodology is well-articulated, detailing how SSRR enhances convergence speed and intelligibility without requiring additional lookahead in streaming architectures. The approach is innovative in its focus on reconstructing self-supervised representations, which is a departure from traditional methods that prioritize mel-spectrogram reconstruction. The use of a single GPU for competitive results is a significant practical consideration that enhances the method's appeal for real-world applications.

Experimental Evaluation

The experiments conducted are robust, demonstrating the effectiveness of SSRR through comparative analysis with existing methods. The results indicate that the proposed JHCodec achieves state-of-the-art performance, particularly in terms of intelligibility and latency, which are critical metrics in audio codec applications. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated further to strengthen the experimental validation.

Reproducibility

The authors have taken steps to ensure reproducibility by open-sourcing the full implementation and training pipeline, which is commendable. The availability of a demo on GitHub allows for practical testing of the proposed system, although the paper could benefit from a more detailed description of the training process and hyperparameters used.

Limitations

One limitation noted is the reliance on self-supervised representations, which may not generalize well across all types of audio content. Additionally, while the zero-lookahead architecture is advantageous for real-time applications, it may impose constraints on the complexity of the audio being processed. The paper could also discuss potential trade-offs between intelligibility and other audio quality metrics, such as fidelity.

Broader Impact

The implications of this work are significant for applications in real-time audio processing, such as telecommunication and streaming services. By achieving high intelligibility with low latency, the proposed codec could enhance user experiences in various audio-related fields. Furthermore, the open-source nature of the project encourages further research and development in neural audio codecs, potentially leading to broader advancements in the field. The main contribution of this paper is the introduction of a self-supervised representation reconstruction loss that significantly enhances the performance of neural audio codecs in terms of intelligibility and latency. This work represents a meaningful advancement in the field of audio processing, providing a practical solution for real-time applications while also contributing to the theoretical understanding of codec training methodologies.

Analysis: Full Paper • Full text: 1,981 characters

StreamVoiceAnon+: Emotion-Preserving Streaming Speaker Anonymization via Frame-Level Acoustic Distillation

Nikita Kuzmin, Kong Aik Lee, Eng Siong Chng · arXiv

We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant aco...

We address the challenge of preserving emotional content in streaming speaker anonymization (SA). Neural audio codec language models trained for audio continuation tend to degrade source emotion: content tokens discard emotional information, and the model defaults to dominant acoustic patterns rather than preserving paralinguistic attributes. We propose supervised finetuning with neutral-emotion utterance pairs from the same speaker, combined with frame-level emotion distillation on acoustic token hidden states. All modifications are confined to finetuning, which takes less than 2 hours on 4 GPUs and adds zero inference latency overhead, while maintaining a competitive 180ms streaming latency. On the VoicePrivacy 2024 protocol, our approach achieves a 49.2% UAR (emotion preservation) with 5.77% WER (intelligibility), a +24% relative UAR improvement over the baseline (39.7%->49.2%) and +10% over the emotion-prompt variant (44.6% UAR), while maintaining strong privacy (EER 49.0%). Demo and code are available: https://anonymous3842031239.github.io/

Institutional Affiliations

Primary: Institute for Infocomm Research (I2R)

All Institutions: Institute for Infocomm Research (I2R), Nanyang Technological University, The Hong Kong Polytechnic University

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a novel supervised finetuning approach combined with frame-level emotion distillation for emotion-preserving streaming speaker anonymization, which significantly improves emotion retention while maintaining privacy and intelligibility. The technical contributions and rigorous methodology present a meaningful advancement in the field of audio processing and speaker anonymization.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative, focusing on supervised finetuning with neutral-emotion utterance pairs and frame-level emotion distillation. This dual approach effectively addresses the limitations of existing neural audio codec models in preserving emotional content during speaker anonymization. The use of neutral-emotion pairs ensures that the model learns to generate emotional outputs without relying on emotional prompts, which can be difficult to obtain. The design choice to apply emotion distillation to the acoustic branch rather than the semantic branch is a significant improvement that allows for cleaner gradient flow and better emotion preservation.

Experimental Evaluation

The experiments are well-structured, adhering to the VoicePrivacy 2024 protocol, which allows for direct comparison with prior works. The results show a substantial improvement in emotion preservation (UAR) and privacy (EER) while maintaining competitive intelligibility (WER). The ablation studies provide clear evidence of the contributions of each component of the proposed method, reinforcing the claims made in the paper. The dataset used for training and evaluation is appropriate, although the reliance on acted speech corpora may limit generalizability.

Reproducibility

The paper provides sufficient details regarding the implementation, including the training setup, data preprocessing, and evaluation metrics. However, the absence of a public code repository limits reproducibility. The authors mention that the demo is available, which is a positive aspect, but a comprehensive project URL would enhance reproducibility further.

Limitations

The paper acknowledges several limitations, including the reliance on a single SER evaluator, the lack of subjective listening tests, and the evaluation being restricted to acted speech corpora. These factors could affect the generalizability and real-world applicability of the findings. Additionally, the gap in performance compared to offline methods suggests that further improvements are needed for practical deployment.

Broader Impact

The proposed method has significant implications for privacy-preserving applications in various domains, including teleconferencing, call centers, and online mental health counseling. By effectively anonymizing speaker identity while preserving emotional content, this research addresses a critical need for maintaining communication effectiveness in sensitive contexts. The approach could pave the way for more sophisticated anonymization techniques that balance privacy and emotional expressiveness. The main contribution of this paper is the introduction of a novel supervised finetuning approach combined with frame-level emotion distillation for emotion-preserving streaming speaker anonymization, which significantly improves emotion retention while maintaining privacy and intelligibility. The technical contributions and rigorous methodology present a meaningful advancement in the field of audio processing and speaker anonymization.

Analysis: Full Paper • Full text: 19,251 characters

Prosodic Boundary-Aware Streaming Generation for LLM-Based TTS with Streaming Text Input

Changsong Liu, Tianrui Wang, Ye Ni ... · arXiv

Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, ad...

Streaming TTS that receives streaming text is essential for interactive systems, yet this scheme faces two major challenges: unnatural prosody due to missing lookahead and long-form collapse due to unbounded context. We propose a prosodic-boundary-aware post-training strategy, adapting a pretrained LLM-based TTS model using weakly time-aligned data. Specifically, the model is adapted to learn early stopping at specified content boundaries when provided with limited future text. During inference, a sliding-window prompt carries forward previous text and speech tokens, ensuring bounded context and seamless concatenation. Evaluations show our method outperforms CosyVoice-Style interleaved baseline in both short and long-form scenarios. In long-text synthesis, especially, it achieves a 66.2% absolute reduction in word error rate (from 71.0% to 4.8%) and increases speaker and emotion similarity by 16.1% and 1.5% relatively, offering a robust solution for streaming TTS with incremental text.

Institutional Affiliations

Primary: Southeast University

All Institutions: Southeast University, Nanyang Technological University, Tianjin University

Demo

ML Relevance Analysis (82)

This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text input. The proposed methodology effectively addresses the challenges of prosody and long-form stability in TTS systems, making a meaningful contribution to the field of audio machine learning.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel prosodic-boundary-aware adaptation strategy that leverages weakly time-aligned data to enhance streaming TTS systems. The bifurcated sequence input with a prosodic-boundary marker allows for improved prosody while maintaining contextual integrity. The sliding-window prompt mechanism effectively manages the context length, preventing unbounded growth and ensuring seamless audio generation. The approach avoids complex architectural modifications, which is a significant advantage. However, the reliance on weakly aligned data raises questions about the generalizability of the method across different datasets and languages.

Experimental Evaluation

The experiments are well-structured, utilizing both objective and subjective metrics to evaluate performance. The use of the Seed-TTS-Eval benchmark for both standard and long-form evaluations provides a comprehensive assessment of the proposed method's effectiveness. The significant improvements in WER, speaker similarity, and emotional consistency demonstrate the robustness of the approach. However, the paper could benefit from a more extensive comparison with additional state-of-the-art methods to further validate its superiority.

Reproducibility

The paper provides sufficient details on the experimental setup, including dataset descriptions, evaluation metrics, and baseline comparisons. However, the lack of a publicly available code repository limits reproducibility. Future work should consider sharing the implementation to facilitate validation by the research community.

Limitations

One limitation is the dependency on weakly time-aligned data, which may not be available for all languages or datasets. Additionally, while the results are promising, the method's performance in highly variable or noisy environments has not been tested. The paper also does not address the potential computational costs associated with the proposed adaptations.

Broader Impact

The advancements in streaming TTS systems have significant implications for interactive applications such as virtual assistants, real-time translation, and accessibility tools. The ability to generate natural-sounding speech with minimal latency can enhance user experience and broaden the applicability of TTS technologies in various domains. This paper presents a boundary-aware post-training strategy for streaming LLM-based text-to-speech with streaming text input. The proposed methodology effectively addresses the challenges of prosody and long-form stability in TTS systems, making a meaningful contribution to the field of audio machine learning.

Analysis: Full Paper • Full text: 17,679 characters

Whisper-CD: Accurate Long-Form Speech Recognition using Multi-Negative Contrastive Decoding

Hoseong Ahn, Jeongyun Chae, Yoonji Park ... · arXiv

Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We pro...

Long-form speech recognition with large encoder-decoder models such as Whisper often exhibit hallucinations, repetition loops, and content omissions. These errors can accumulate and be further amplified when the previous segment's transcription is used as decoding context. We propose Whisper-CD, a training-free contrastive decoding framework that contrasts clean-audio logits against negative logits computed from three acoustically motivated perturbations: Gaussian noise injection, silence signal, and audio temporal shift. We aggregate these negatives via the log-sum-exp operator, building a unified multi-negative objective for token-by-token decoding. Across five English long-form benchmarks, Whisper-CD reduces WER by up to 24.3pp on CORAAL and shows 48% faster token generation throughput than beam search. Because Whisper-CD operates purely at inference time, it can be applied as a drop-in replacement to already-deployed Whisper systems without retraining.

Institutional Affiliations

Primary: Sungkyunkwan University

All Institutions: Sungkyunkwan University

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of Whisper-CD, a contrastive decoding framework that significantly improves long-form speech recognition accuracy without the need for retraining. This work is a meaningful advancement in the field, addressing critical issues in existing models and offering a practical solution that can be readily adopted in deployed systems.

Comprehensive Analysis

Methodology Assessment

The proposed Whisper-CD framework introduces a novel approach to long-form speech recognition by employing a training-free contrastive decoding method. This method contrasts clean-audio logits with negative logits derived from various perturbations, which is innovative in the context of improving the robustness of speech recognition systems. The use of the log-sum-exp operator to aggregate negative samples is a thoughtful choice that enhances the decoding process. However, the paper could benefit from a more detailed explanation of the perturbations chosen and their specific impact on the model's performance.

Experimental Evaluation

The paper presents a comprehensive evaluation across five English long-form benchmarks, demonstrating a significant reduction in word error rate (WER) and improved token generation throughput. The results are compelling, particularly the 24.3 percentage point reduction in WER on the CORAAL dataset and the 48% increase in throughput compared to traditional beam search methods. However, additional details on the datasets used, including their characteristics and the specific metrics employed, would strengthen the experimental section.

Reproducibility

The paper lacks sufficient implementation details, such as hyperparameters, specific configurations of the Whisper model, and the exact nature of the perturbations applied. Without these details, it may be challenging for other researchers to reproduce the results. Including a link to a code repository or supplementary material would greatly enhance reproducibility.

Limitations

One limitation of the Whisper-CD approach is that it operates purely at inference time, which, while advantageous for deployment, may limit its adaptability to different audio conditions or languages without retraining. Additionally, the reliance on specific perturbations may not generalize across all types of audio inputs, potentially affecting performance in diverse real-world scenarios.

Broader Impact

The proposed method has significant implications for the field of speech recognition, particularly in applications requiring high accuracy over long-form audio, such as transcription services, media content creation, and accessibility technologies. By improving the reliability of long-form speech recognition systems, Whisper-CD can enhance user experience and broaden the adoption of such technologies. The main contribution of this paper is the introduction of Whisper-CD, a contrastive decoding framework that significantly improves long-form speech recognition accuracy without the need for retraining. This work is a meaningful advancement in the field, addressing critical issues in existing models and offering a practical solution that can be readily adopted in deployed systems.

Analysis: Full Paper • Full text: 152 characters

Activation Steering for Accent Adaptation in Speech Foundation Models

Jinuo Sun, Yang Xiao, Sung Kyun Chung ... · arXiv

Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and inv...

Accent variability remains a major errors in automatic speech recognition, yet most adaptation methods rely on parameter fine-tuning without understanding where accent information is encoded. We treat accent variation as an interpretable subspace in hidden representations and investigate whether it can be identified and controlled directly in activation space. We extract layer-wise encoder activations and estimate mean-shift directions capturing accent-induced representation shifts. By injecting these directions into individual layers and measuring how they align accented and standard embeddings, we derive a layer-wise accent sensitivity profile, revealing that accent information concentrates in a narrow band of middle encoder layers. Leveraging this structure, we further introduce parameter-free accent steering that modifies representations during inference without updating model weights. Experiments across eight accents show consistent word error rate reductions.

Institutional Affiliations

Primary: The University of Melbourne

All Institutions: Wuhan University, The University of Melbourne

ML Relevance Analysis (77)

The main contribution of this paper is the introduction of a novel method for accent adaptation in speech recognition models that operates in activation space, providing a deeper understanding of accent representation in neural networks. The approach is innovative and has the potential to significantly improve the performance of speech recognition systems across diverse accents, marking a meaningful advancement in the field.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative as it shifts the focus from traditional parameter fine-tuning to a more interpretable approach that directly manipulates activation space for accent adaptation. The authors successfully extract layer-wise encoder activations and compute mean-shift directions to capture accent-induced shifts, which is a novel contribution to the understanding of how accents are encoded in neural networks. The introduction of parameter-free accent steering is particularly noteworthy, as it allows for real-time adjustments during inference without the need for retraining, which could have significant practical implications.

Experimental Evaluation

The experiments conducted across eight different accents provide a robust evaluation of the proposed method. The consistent reductions in word error rates across these accents demonstrate the effectiveness of the approach. However, the paper could benefit from a more detailed description of the datasets used, including their size, diversity, and how they were selected. Additionally, comparisons with existing state-of-the-art methods would strengthen the validation of the proposed technique.

Reproducibility

The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the methodology is described, specific hyperparameters, training procedures, and the architecture of the models used in experiments are not adequately detailed. Providing a link to a code repository or supplementary materials would enhance reproducibility.

Limitations

One limitation of the study is the potential overfitting to the specific accents tested, as the results may not generalize to other accents or dialects not included in the experiments. Additionally, the approach relies on the assumption that accent information is concentrated in specific layers, which may not hold true for all architectures or tasks. The paper also does not address the computational efficiency of the proposed method during inference.

Broader Impact

The findings of this research could have significant implications for the development of more inclusive and accurate speech recognition systems, particularly in multilingual and multicultural contexts. By improving accent adaptation, the proposed method could enhance user experience and accessibility in various applications, from virtual assistants to automated transcription services. The main contribution of this paper is the introduction of a novel method for accent adaptation in speech recognition models that operates in activation space, providing a deeper understanding of accent representation in neural networks. The approach is innovative and has the potential to significantly improve the performance of speech recognition systems across diverse accents, marking a meaningful advancement in the field.

Analysis: Full Paper • Full text: 1,188 characters

An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production

Jihwan Lee, Parsa Razmara, Kevin Huang ... · arXiv

Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological ...

Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.

Institutional Affiliations

Primary: University of Southern California

All Institutions: University of Southern California

ML Relevance Analysis (83)

This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is innovative, combining real-time MRI, EEG, and surface EMG to capture the complex dynamics of speech production. The authors have developed a multi-stage denoising pipeline to address significant technical challenges, including electromagnetic interference and myogenic artifacts. The use of canonical correlation analysis (CCA) for artifact removal is particularly noteworthy, as it allows for the effective suppression of non-neural signals while preserving the underlying neural activity. However, the methodology could benefit from further validation across a larger cohort to establish its robustness and generalizability.

Experimental Evaluation

The experimental design is well-structured, focusing on a single subject to explore the feasibility of simultaneous data acquisition. The tasks are clearly defined, and the results demonstrate the effectiveness of the artifact removal techniques. However, the reliance on a single participant limits the generalizability of findings. The authors provide thorough comparisons of EEG signals before and after denoising, showcasing significant improvements in signal quality, which is a strong point of the experimental evaluation.

Reproducibility

The paper provides detailed descriptions of the experimental setup, data acquisition methods, and artifact correction techniques, which are essential for reproducibility. However, the lack of a publicly available dataset or code repository hinders full reproducibility of the results. Future work should include sharing data and methodologies to allow other researchers to validate and build upon these findings.

Limitations

The primary limitations include the small sample size (single-subject study), which restricts the ability to generalize findings. Additionally, the use of passive electrodes may introduce higher noise levels compared to active electrodes, potentially affecting data quality. The EEG cap's design may not be optimal for capturing speech-specific brain activity, and residual artifacts from the EMG setup could still influence results. Lastly, the potential impact of scanner noise and visual stimuli on neural activity remains a concern.

Broader Impact

This research has significant implications for both speech neuroscience and brain-computer interface (BCI) technologies. By providing a comprehensive view of the neural, muscular, and articulatory components of speech production, the findings could lead to advancements in silent speech interfaces and improved understanding of speech disorders. The methodology could pave the way for future studies exploring the intricacies of speech planning and execution, potentially transforming approaches to speech rehabilitation and communication technologies. This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.

Analysis: Full Paper • Full text: 20,488 characters

BabAR: from phoneme recognition to developmental measures of young children's speech production

Marvin Lavechin, Elika Bergelson, Roger Levy · arXiv

Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribe...

Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.

Institutional Affiliations

Primary: Harvard University

All Institutions: Harvard University, Massachusetts Institute of Technology

Demo · GitHub

ML Relevance Analysis (83)

The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel phoneme recognition system, BabAR, tailored for child speech, leveraging a large-scale dataset, TinyVox, which encompasses diverse languages and extensive child vocalizations. The methodology includes pretraining on multilingual data and context-aware fine-tuning, which are innovative approaches in the domain of child speech recognition. The use of Connectionist Temporal Classification (CTC) for sequence-to-sequence tasks is appropriate given the challenges of variable-length outputs in phoneme recognition. The systematic evaluation of different self-supervised models and the exploration of context duration for improving recognition accuracy are well-structured and contribute significantly to the methodology.

Experimental Evaluation

The experiments are robust, comparing BabAR against state-of-the-art phoneme recognition systems and demonstrating significant performance improvements. The paper provides detailed error analysis, illustrating that BabAR's substitutions tend to remain within phonetic categories, which is crucial for developmental analyses. The validation of BabAR's performance against a longitudinal dataset supports its practical applicability in developmental research. However, the paper could benefit from more extensive comparisons with existing systems and additional metrics beyond phoneme error rates to fully capture the model's effectiveness.

Reproducibility

The authors provide sufficient implementation details, including model architecture, training procedures, and evaluation metrics, which enhance reproducibility. The availability of the dataset and code on GitHub is a significant step towards ensuring that other researchers can replicate the study and build upon it. However, the paper could improve by including more explicit instructions for setting up the environment and dependencies.

Limitations

The study acknowledges challenges in phonetic transcription, particularly the subjective nature of human annotation and the presence of competing signals in naturalistic recordings. While BabAR shows promise, the reliance on coarse-grained measures for validation may not guarantee accuracy at the individual level, which is critical for clinical applications. Additionally, the dataset's diversity in terms of language and age could introduce variability that may affect generalization.

Broader Impact

The development of BabAR and TinyVox has the potential to revolutionize the study of early speech development by enabling large-scale, automated phonetic analysis. This could facilitate early detection of speech and language delays, enhance cross-linguistic studies, and improve educational tools for language learning. The integration of advanced machine learning techniques with developmental science opens up new avenues for research and practical applications in child language acquisition. The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.

Analysis: Full Paper • Full text: 33,227 characters

Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial

Jielin Qiu, Zixiang Chen, Liangwei Yang ... · arXiv

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to ...

We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.

Institutional Affiliations

Primary: Salesforce AI Research

All Institutions: Salesforce AI Research

GitHub

ML Relevance Analysis (83)

The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.

Comprehensive Analysis

Methodology Assessment

The paper presents a systematic approach to building enterprise-grade realtime voice agents by dissecting the components of speech-to-text (STT), language model (LLM), and text-to-speech (TTS) into a cascaded streaming pipeline. The authors emphasize the importance of streaming and pipelining rather than relying on a single fast model, which is a critical insight for achieving low latency in voice interactions. The tutorial format is effective, providing a step-by-step guide that includes empirical evaluations of various models, thus making the methodology accessible and practical for developers.

Experimental Evaluation

The experiments conducted are robust, comparing the performance of native speech-to-speech models against a cascaded pipeline approach. The authors provide detailed latency measurements for each component, demonstrating the effectiveness of their proposed architecture in achieving sub-1-second time-to-first-audio (TTFA). The empirical results are well-documented, showcasing the advantages of their approach in real-world scenarios, which adds credibility to their findings.

Reproducibility

The paper includes a comprehensive codebase released as open-source, which is a significant advantage for reproducibility. The detailed tutorial format, along with the release of tested code for each component, allows other researchers and practitioners to replicate the results and build upon the work. However, the paper could benefit from clearer documentation on the specific environments and dependencies required to run the code effectively.

Limitations

One limitation noted is the reliance on cloud APIs for some components, which may introduce variability in performance due to network latency. Additionally, the findings are based on specific models and configurations, which may not generalize across all potential implementations. The authors also acknowledge that native speech-to-speech models are not yet viable for real-time applications, which highlights the current constraints in the field.

Broader Impact

This work has significant implications for the development of voice-based AI agents in enterprise settings, particularly in applications such as customer service, healthcare, and task management. By providing a clear framework and practical guidance, the paper can facilitate the adoption of real-time voice agents, potentially transforming user interactions across various industries. The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.

Analysis: Full Paper • Full text: 21,800 characters

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis

Yen-Shan Chen, Shih-Yu Lai, Ying-Jung Tsou ... · arXiv

While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the impercept...

While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the imperceptible waveform variations used in prior watermarking methods. To address this limitation, we propose Latent-Mark, the first zero-bit audio watermarking framework designed to survive semantic compression. Our key insight is that robustness to the encode-decode process requires embedding the watermark within the codec's invariant latent space. We achieve this by optimizing the audio waveform to induce a detectable directional shift in its encoded latent representation, while constraining perturbations to align with the natural audio manifold to ensure imperceptibility. To prevent overfitting to a single codec's quantization rules, we introduce Cross-Codec Optimization, jointly optimizing the waveform across multiple surrogate codecs to target shared latent invariants. Extensive evaluations demonstrate robust zero-shot transferability to unseen neural codecs, achieving state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility. Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University, CyCraft AI Lab, MoonShine Animation Studio, RIKEN Center for Computational Science

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.

Comprehensive Analysis

Methodology Assessment

The methodology presented in Latent-Mark is innovative, leveraging the concept of embedding watermarks within the invariant latent space of neural audio codecs. The approach of optimizing the audio waveform to induce a detectable shift while ensuring imperceptibility is a significant advancement over traditional methods. The introduction of Cross-Codec Optimization is particularly noteworthy, as it addresses the challenge of overfitting to specific codec characteristics, enhancing the generalizability of the watermarking technique across different audio codecs.

Experimental Evaluation

The paper provides extensive evaluations demonstrating the robustness of the proposed method against both traditional DSP attacks and neural resynthesis. The experiments are well-structured, showcasing the performance of Latent-Mark in various scenarios, including zero-shot transferability to unseen codecs. The results indicate a strong resilience to attacks while maintaining perceptual quality, which is crucial for practical applications.

Reproducibility

The paper lacks detailed implementation specifics, such as code availability or datasets used for training and evaluation, which could hinder reproducibility. Providing a GitHub repository or links to datasets would significantly enhance the reproducibility of the results.

Limitations

One limitation of the study is the potential dependency on the specific codecs chosen for Cross-Codec Optimization. While the method shows promise, its performance on a broader range of codecs, especially those not included in the training phase, remains to be fully explored. Additionally, the paper does not address the computational complexity of the optimization process, which could impact real-time applications.

Broader Impact

The implications of this research are significant, as it opens avenues for secure audio transmission and copyright protection in an era where neural codecs are becoming prevalent. The ability to maintain watermark integrity against advanced generative models could have far-reaching applications in media, entertainment, and digital rights management. The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.

Analysis: Full Paper • Full text: 229 characters

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

Han Yin, Yang Xiao, Rohan Kumar Das ... · arXiv

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake...

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

Institutional Affiliations

Primary: University of Melbourne

All Institutions: University of Melbourne, Republic of Korea, School of Electrical Engineering

GitHub

ML Relevance Analysis (83)

The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.

Comprehensive Analysis

Methodology Assessment

The paper introduces a structured approach to environmental sound deepfake detection (ESDD) through the formulation of a challenge that includes a well-defined dataset (EnvSDD) and evaluation protocols. The methodology is robust, focusing on two distinct tracks that assess generalization across unseen generators and black-box scenarios, which are critical for real-world applications. The use of diverse audio generation models and the emphasis on cross-generator generalization are notable strengths. However, the paper could benefit from a more detailed explanation of the architectural choices made by the top-performing systems.

Experimental Evaluation

The experimental evaluation is comprehensive, with a large number of submissions (1,748) from 97 teams, indicating significant interest and engagement in the challenge. The results are systematically presented, showcasing the performance of baseline systems and top submissions across different tracks. The use of the Equal Error Rate (EER) as a metric is appropriate for the task, and the analysis of system design trends provides valuable insights into effective strategies for ESDD.

Reproducibility

While the paper mentions the availability of the EnvSDD dataset and the challenge results, it lacks detailed implementation specifics that would facilitate reproducibility. The inclusion of code repositories or links to the actual implementations of the top-performing systems would enhance reproducibility and allow other researchers to build upon this work.

Limitations

One limitation is the potential overfitting of models to specific generators, as indicated by performance degradation on unseen generators. Additionally, the challenge does not address the potential for adversarial attacks on detection systems, which could be a significant concern in practical applications. The reliance on a specific evaluation metric (EER) may also limit the understanding of model performance across different contexts.

Broader Impact

The implications of this work are significant, as it addresses a growing concern in the realm of audio deepfakes, which can have serious consequences for public safety and misinformation. The establishment of a benchmark for ESDD could catalyze further research and development in this area, leading to more robust detection systems that can be applied in various real-world scenarios, including security and media verification. The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.

Analysis: Full Paper • Full text: 17,935 characters

The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights

Han Yin, Yang Xiao, Rohan Kumar Das ... · arXiv

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake...

Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.

Institutional Affiliations

Primary: University of Melbourne

All Institutions: University of Melbourne, Republic of Korea, School of Electrical Engineering

GitHub

ML Relevance Analysis (83)

The paper presents the first comprehensive challenge for environmental sound deepfake detection, establishing a significant benchmark in the field. The methodology and results contribute to advancing the understanding of audio deepfake detection, highlighting both the challenges and opportunities for future research.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel challenge for environmental sound deepfake detection (ESDD), which is a significant gap in the current literature. The methodology includes the creation of a large-scale dataset (EnvSDD) and a structured challenge with two tracks to evaluate the robustness of detection systems. The task formulation is clearly defined, and the evaluation metrics are appropriate for the objectives. The analysis of architectural choices and training strategies among top-performing systems provides valuable insights into effective approaches for ESDD.

Experimental Evaluation

The experimental evaluation is robust, featuring a large number of submissions (1,748) from 97 teams, indicating significant interest and engagement from the research community. The results are well-documented, with clear comparisons between baseline systems and participant submissions. The use of Equal Error Rate (EER) as a metric is suitable for the binary classification task, and the challenge results highlight the varying performance across different generators, which is critical for understanding the challenges in the field.

Reproducibility

The paper provides sufficient details regarding the dataset construction and evaluation protocols, which are essential for reproducibility. However, the lack of direct access to code or detailed implementation specifics for the top-performing models may hinder full reproducibility. The challenge's website may provide additional resources, but direct links to code repositories would enhance reproducibility.

Limitations

One limitation noted is the focus on clip-level classification, which may not adequately address the complexities of real-world audio scenarios where multiple sound events occur simultaneously. Additionally, the challenge primarily addresses detection without exploring the implications of false positives and negatives in practical applications, which could be a significant concern in real-world deployments.

Broader Impact

The implications of this research are substantial, particularly in contexts where environmental sounds can be manipulated to create misinformation or panic (e.g., fake alarms or gunshots). The findings could inform the development of more robust detection systems applicable in security, media verification, and public safety. The challenge also sets a precedent for future research in audio deepfake detection, encouraging cross-domain approaches and the exploration of multimodal detection strategies. The paper presents the first comprehensive challenge for environmental sound deepfake detection, establishing a significant benchmark in the field. The methodology and results contribute to advancing the understanding of audio deepfake detection, highlighting both the challenges and opportunities for future research.

Analysis: Full Paper • Full text: 17,935 characters

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin ... · arXiv

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline lever...

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: National Taiwan University, Shanghai Jiao Tong University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.

Comprehensive Analysis

Methodology Assessment

The paper introduces a robust methodology for constructing a large-scale audio-text dataset, TW-Sound580K, specifically targeting the unique acoustic characteristics of Taiwanese dialects. The Verify-Generate-Critique (VGC) protocol is a notable innovation, effectively addressing the challenges of data curation in a linguistically diverse context. The integration of Dual-ASR validation to filter and enhance the dataset quality is commendable, as it mitigates the risks of hallucinations in audio transcription. The dynamic Dual-ASR Arbitration mechanism further strengthens the inference process by selecting the most accurate transcription based on acoustic-conditioned perplexity, showcasing a thoughtful approach to model adaptation.

Experimental Evaluation

The experimental validation of the Tai-LALM model on the TAU Benchmark demonstrates a significant performance improvement over the baseline, achieving 49.1% accuracy. This empirical evidence supports the effectiveness of the proposed dataset and methodology. The paper includes a comprehensive ablation study that isolates the contributions of various components, reinforcing the robustness of the findings. However, the reliance on a single benchmark may limit the generalizability of the results.

Reproducibility

The authors provide a clear outline of their methodology, including the dataset construction process and the training setup for Tai-LALM. However, the lack of direct access to the raw audio data due to copyright constraints poses challenges for full reproducibility. The mention of providing source URLs and metadata upon de-anonymization is a positive step towards enabling future research.

Limitations

The paper acknowledges several limitations, including the empirical nature of the VGC curation threshold, which may require recalibration for different regions. Additionally, the latency and VRAM overhead introduced by the Dual-ASR arbitration could hinder deployment in resource-constrained environments. The evaluation primarily focuses on the TAU Benchmark, which may not capture the full spectrum of performance across diverse acoustic scenarios.

Broader Impact

This work has significant implications for the development of localized audio-language models, particularly in under-resourced linguistic regions. By addressing the localization gap, the proposed dataset and methodologies can enhance the performance of LALMs in understanding regional dialects and acoustic features. The framework established in this paper could serve as a model for similar efforts in other culturally rich but underrepresented areas. The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.

Analysis: Full Paper • Full text: 16,243 characters

Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Linghan Fang, Tianxin Xie, Li Liu · arXiv

Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue...

Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.

Institutional Affiliations

Primary: Unknown

All Institutions: Unknown

GitHub

ML Relevance Analysis (79)

The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.

Comprehensive Analysis

Methodology Assessment

The proposed ASR-TRA framework introduces a novel approach to test-time adaptation (TTA) in automatic speech recognition (ASR) by leveraging reinforcement learning (RL) and causal interventions. The methodology is well-structured, utilizing a learnable decoder prompt and temperature-controlled stochastic decoding to generate diverse transcription candidates. The integration of a reward model based on audio-text semantic alignment is a significant innovation that addresses the limitations of existing TTA methods, which often rely on pseudo-labeling or entropy minimization. The use of a Structural Causal Model (SCM) to formalize the adaptation process adds rigor to the approach, although the paper could benefit from a more detailed explanation of the causal relationships involved.

Experimental Evaluation

The experiments conducted on the LibriSpeech and L2 Arctic datasets demonstrate the effectiveness of ASR-TRA in improving ASR robustness against noise and accent variations. The results indicate a significant reduction in word error rates (WER) compared to existing TTA methods, showcasing the practical applicability of the proposed framework. The ablation studies provide valuable insights into the contributions of different components, confirming the importance of both prompt tuning and reward modeling. However, the paper could enhance its experimental evaluation by including more diverse datasets and real-world scenarios to further validate the robustness of the method.

Reproducibility

The paper provides sufficient details regarding the implementation of ASR-TRA, including the architecture, datasets, and evaluation metrics. The inclusion of hyperparameters and specific configurations aids in reproducibility. However, the lack of a comprehensive description of the training process and the absence of a public demo could hinder full reproducibility for other researchers.

Limitations

One limitation of the proposed method is its reliance on the CLAP reward model, which may not generalize well across all types of audio inputs. Additionally, while the method shows improvements in accuracy and latency, the computational cost associated with generating multiple candidates and evaluating them could be a concern in resource-constrained environments. The paper also does not address potential scalability issues when deploying the model in real-time applications.

Broader Impact

The ASR-TRA framework has the potential to significantly enhance the robustness of ASR systems in real-world applications, particularly in environments with high noise levels or diverse accents. This could lead to improved accessibility and user experience in various domains, including voice-activated assistants, transcription services, and communication aids for individuals with speech impairments. The focus on test-time adaptation without requiring ground-truth labels is particularly relevant for applications where labeled data is scarce or unavailable. The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.

Analysis: Full Paper • Full text: 26,303 characters

When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper

Akif Islam, Raufun Nahar, Md. Ekramul Hamid · IEEE Conference Paper

Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero...

Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.

Institutional Affiliations

Primary: University of Rajshahi

All Institutions: University of Rajshahi, Anan National College of Technology

ML Relevance Analysis (77)

The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, employing a systematic empirical study to evaluate the impact of SAM-Audio on zero-shot ASR performance across two distinct datasets. The authors clearly outline their preprocessing pipeline, ASR models, and evaluation metrics, ensuring that the study is well-structured and reproducible. However, the reliance on a single variant of SAM-Audio due to computational constraints may limit the generalizability of the findings.

Experimental Evaluation

The experiments are comprehensive, covering multiple Whisper model variants and two linguistically diverse datasets. The use of WER and CER as primary metrics is appropriate for assessing ASR performance. The results consistently demonstrate that SAM-Audio preprocessing degrades ASR performance, which is a significant finding that challenges existing assumptions in the field.

Reproducibility

The paper provides sufficient detail regarding the experimental setup, including datasets and evaluation protocols, which facilitates reproducibility. However, the lack of access to the SAM-Audio model variants used in the experiments may hinder full reproducibility for other researchers.

Limitations

The study is limited by the use of only the SAM-Audio Small variant and the focus on zero-shot ASR, which may not capture the full potential of the enhancement model. Additionally, the analysis is based on two datasets, which may not encompass the full range of real-world acoustic conditions.

Broader Impact

This research has significant implications for the field of ASR and speech enhancement, as it highlights the risks of applying denoising techniques without considering their impact on recognition accuracy. The findings encourage a reevaluation of preprocessing strategies in ASR systems, particularly in zero-shot settings. The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.

Analysis: Full Paper • Full text: 28,565 characters

Voice Timbre Attribute Detection with Compact and Interpretable Training-Free Acoustic Parameters

Aemon Yat Fei Chiu, Yujia Xiao, Qiuqiang Kong ... · arXiv

Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in sp...

Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in speaker modelling, they often act as black-box representations with limited physical interpretability and high computational cost. In this work, a compact acoustic parameter set is investigated for vTAD. The set captures important acoustic measures and their temporal dynamics which are found to be crucial in the task. Despite its simplicity, the acoustic parameter set is competitive, outperforming conventional cepstral features and supervised DNN embeddings, and approaching state-of-the-art self-supervised models. Importantly, the studied set require no trainable parameters, incur negligible computation, and offer explicit interpretability for analysing physical traits behind human timbre perception.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of a compact and interpretable acoustic parameter set for voice timbre attribute detection, which effectively competes with complex DNN-based approaches while offering significant advantages in interpretability and computational efficiency. The research addresses a critical gap in the field by providing a practical solution that balances performance with the need for understanding the underlying acoustic features relevant to human speech perception.

Comprehensive Analysis

Methodology Assessment

The paper proposes a novel approach to voice timbre attribute detection (vTAD) using a compact set of acoustic parameters that captures essential features without requiring training. This method contrasts with traditional deep neural networks (DNNs), which are often computationally intensive and lack interpretability. The methodology is well-structured, focusing on the extraction of 13 acoustic features and their temporal dynamics, leading to a 26-dimensional representation. The use of a simple Diff-Net for classification is appropriate, although the paper could benefit from more detailed descriptions of the feature extraction process and the rationale behind the choice of acoustic parameters.

Experimental Evaluation

The experiments are robust, utilizing a well-defined dataset (VCTK-RVA) with expert annotations, which enhances the reliability of the results. The performance metrics (Accuracy and EER) are clearly presented, showing that the proposed method competes well against established DNN-based models. However, the paper could improve by providing more comparative analysis with other state-of-the-art methods and discussing the implications of the results in greater detail.

Reproducibility

The paper lacks sufficient implementation details that would facilitate reproducibility. While the methodology is described, specific parameters, configurations, and code availability are not mentioned, which could hinder other researchers from replicating the results.

Limitations

One limitation is the reliance on a single dataset, which may affect the generalizability of the findings. Additionally, while the proposed method is interpretable, the paper does not fully explore the implications of this interpretability in practical applications. The absence of a demo or project URL also limits accessibility for further exploration of the work.

Broader Impact

The study has significant implications for fields requiring voice analysis, such as forensics, healthcare, and human-computer interaction. The focus on interpretability and computational efficiency can lead to more accessible and user-friendly applications in speech technology. The findings could influence future research directions in audio processing and speech perception, particularly in developing systems that prioritize interpretability alongside performance. The main contribution of this paper is the introduction of a compact and interpretable acoustic parameter set for voice timbre attribute detection, which effectively competes with complex DNN-based approaches while offering significant advantages in interpretability and computational efficiency. The research addresses a critical gap in the field by providing a practical solution that balances performance with the need for understanding the underlying acoustic features relevant to human speech perception.

Analysis: Full Paper • Full text: 19,587 characters

Cyclostationarity Analysis as a Complement to Self-Supervised Representations for Speech Deepfake Detection

Cemal Hanilçi, Md Sahidullah, Tomi Kinnunen · arXiv

Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven...

Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\%$ to $0.98\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.

Institutional Affiliations

Primary: Bursa Technical University

All Institutions: Bursa Technical University, TCG CREST, University of Eastern Finland

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel cyclostationarity-inspired feature extraction framework for speech deepfake detection (SDD) that leverages spectral correlation density (SCD) to capture periodic statistical structures in speech. The methodology is well-grounded in signal processing theory, addressing the limitations of conventional time-frequency representations. The proposed two-dimensional SCD features are designed to incorporate temporal dynamics, which enhances their discriminative power. The use of multiple countermeasure architectures, including convolutional neural networks and self-supervised learning embeddings, demonstrates a comprehensive approach to evaluating the effectiveness of the proposed features.

Experimental Evaluation

The experiments are robust, utilizing three challenging datasets (ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5) to validate the proposed features. The results indicate significant improvements in equal error rates when combining SCD features with SSL embeddings, showcasing the complementarity of the approaches. The experimental setup is thorough, with clear metrics for performance evaluation (EER and minDCF), and the results are presented in a manner that highlights the advantages of the proposed methods over existing baselines.

Reproducibility

The paper provides sufficient detail regarding the experimental setup, including datasets, feature extraction methods, and model architectures. However, the absence of a publicly available code repository limits the reproducibility of the results. The authors do provide a demo URL for synthesized speech, which is beneficial but does not fully compensate for the lack of code.

Limitations

One limitation is the reliance on specific datasets, which may not capture the full diversity of speech deepfake scenarios. Additionally, while the results are promising, the paper does not address potential overfitting issues or the generalizability of the models to unseen spoofing techniques. The computational complexity of the SCD feature extraction process may also pose challenges for real-time applications.

Broader Impact

The proposed methodology has significant implications for enhancing the security and trustworthiness of voice-driven technologies, particularly in applications like audio forensics and telecommunication security. By improving the detection of speech deepfakes, the research contributes to the broader field of audio signal processing and machine learning, addressing a critical need in the era of advanced synthetic media. The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.

Analysis: Full Paper • Full text: 50,026 characters

Low-Resource Guidance for Controllable Latent Audio Diffusion

Zachary Novack, Zack Zukowski, CJ Carr ... · ICASSP 2026

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-...

Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.

Institutional Affiliations

Primary: UC -- San Diego

All Institutions: UC -- San Diego

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel approach to controllable audio generation through the use of Latent-Control Heads (LatCHs) and selective Training-Free Guidance (TFG). By operating directly in latent space, the proposed method significantly reduces computational overhead associated with traditional end-to-end guidance methods. The methodology is well-structured, with clear explanations of how LatCHs function and the rationale behind selective TFG. The authors provide a solid theoretical foundation, linking their work to existing literature while clearly delineating their contributions.

Experimental Evaluation

The experiments are comprehensive, utilizing the Stable Audio Open (SAO) dataset and comparing the proposed methods against established baselines, including end-to-end guidance and readouts. The evaluation metrics are well-defined, including both qualitative assessments (mean opinion scores) and quantitative metrics (FDopenl3, KLpasst, and CLAP). The results demonstrate that LatCHs outperform traditional methods in terms of both audio quality and computational efficiency, which is a significant achievement in the field of audio generation.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including hyperparameters and training procedures for LatCHs. However, the lack of a publicly available code repository may hinder full reproducibility. The authors do mention the datasets used, which aids in replicating the experiments, but the absence of a project URL limits access to the implementation.

Limitations

One limitation is the potential challenge in generalizing the method to more complex audio generation tasks beyond the evaluated controls (intensity, pitch, and beats). Additionally, the reliance on specific feature extractors may limit the applicability of the approach to other audio domains. The authors also note that controls with greater variability, such as pitch, pose challenges, indicating room for improvement in handling such cases.

Broader Impact

The proposed framework has significant implications for the field of generative audio, particularly in applications requiring real-time audio manipulation and control. The ability to generate high-quality audio with low computational costs can benefit various industries, including music production, gaming, and virtual reality. Furthermore, the approach could pave the way for more accessible audio generation tools for creators without extensive computational resources. The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.

Analysis: Full Paper • Full text: 23,826 characters

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Fei Su, Cancan Li, Juan Liu ... · arXiv

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AV...

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.

Institutional Affiliations

Primary: Duke Kunshan University

All Institutions: Duke Kunshan University, The Chinese University of Hong Kong, Wuhan University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces several innovative components such as Sparse Modality Alignment (SMA), Adaptive Modulated Fusion (AMF), and Visual Unit-Guided Refinement (VUR). SMA allows for a more controlled interaction between audio and visual modalities by inserting alignment blocks into the audio encoder, which is a significant improvement over existing methods that typically rely on shallow fusion. The AMF component intelligently modulates visual feature injection based on acoustic reliability, enhancing the model's adaptability to varying input conditions. The VUR approach effectively transforms visual representations into discrete tokens for LLM rescoring, which is a novel strategy that leverages the strengths of both visual and language models. Overall, the methodology is well-structured and addresses key limitations in prior AVSR systems.

Experimental Evaluation

The experiments conducted on the LRS3 dataset demonstrate the effectiveness of the proposed model, achieving state-of-the-art results in various noise conditions. The reported 37% relative improvement in Word Error Rate (WER) under 0 dB SNR conditions is particularly noteworthy, showcasing the robustness of the model in challenging scenarios. The ablation studies provide additional insights into the contributions of each component, reinforcing the validity of the proposed framework. However, the paper could benefit from a more detailed discussion on the statistical significance of the results and comparisons with a broader range of existing methods.

Reproducibility

The paper provides a comprehensive overview of the experimental setup, including details on the dataset, model architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider making the implementation accessible to facilitate validation by the research community.

Limitations

One limitation of the study is the reliance on a single dataset (LRS3) for evaluation, which may not fully capture the generalizability of the model across different domains or languages. Additionally, while the method shows improvements in noise robustness, the paper does not explore the performance in extremely adverse conditions or with diverse accents and speech patterns. The computational efficiency of the proposed model, particularly in real-time applications, is also not thoroughly addressed.

Broader Impact

The advancements in AVSR presented in this paper have significant implications for various applications, including assistive technologies for the hearing impaired, video conferencing systems, and automated transcription services. By enhancing the robustness of speech recognition in noisy environments, this research contributes to making communication technologies more accessible and effective. The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.

Analysis: Full Paper • Full text: 15,784 characters

Temporal Pooling Strategies for Training-Free Anomalous Sound Detection with Self-Supervised Audio Embeddings

Kevin Wilkinghoff, Sarthak Yadav, Zheng-Hua Tan · arXiv

Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However,...

Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However, existing embedding-based approaches almost exclusively rely on temporal mean pooling, while alternative pooling strategies have so far only been explored for spectrogram-based representations. Consequently, the role of temporal pooling in training-free ASD with pre-trained embeddings remains insufficiently understood. In this paper, we present a systematic evaluation of temporal pooling strategies across multiple state-of-the-art audio embedding models. We propose relative deviation pooling (RDP), an adaptive pooling method that emphasizes informative temporal deviations, and introduce a hybrid pooling strategy that combines RDP with generalized mean pooling. Experiments on five benchmark datasets demonstrate that the proposed methods consistently outperform mean pooling and achieve state-of-the-art performance for training-free ASD, including results that surpass all previously reported trained systems and ensembles on the DCASE2025 ASD dataset.

Institutional Affiliations

Primary: Aalborg University

All Institutions: Aalborg University, Pioneer Centre for Artificial Intelligence

ML Relevance Analysis (83)

The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.

Comprehensive Analysis

Methodology Assessment

The paper introduces relative deviation pooling (RDP) and a hybrid pooling strategy that combines RDP with generalized mean pooling (GEM). This approach emphasizes informative temporal deviations, addressing a significant gap in the current understanding of temporal pooling in training-free anomalous sound detection (ASD). The systematic evaluation of various pooling strategies across multiple state-of-the-art audio embedding models is a strong methodological contribution, as it not only highlights the importance of pooling mechanisms but also provides a framework for future research in this area.

Experimental Evaluation

The experiments are conducted on five benchmark datasets, demonstrating that the proposed methods consistently outperform traditional mean pooling and achieve state-of-the-art performance for training-free ASD. The results are rigorously analyzed, showing significant improvements over existing methods, including previously reported trained systems. The paper includes comprehensive comparisons and ablation studies, validating the effectiveness of the proposed pooling strategies.

Reproducibility

The paper provides detailed descriptions of the datasets, experimental setup, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or demo URLs limits the ability for others to directly replicate the findings. The authors mention the use of specific hyperparameters but do not provide a repository for the implementation, which could be a barrier for reproducibility.

Limitations

One limitation is the reliance on pre-trained audio embedding models, which may not generalize well to all types of anomalous sounds. Additionally, while the proposed pooling strategies show significant improvements, the paper does not explore the potential of integrating these methods into supervised or semi-supervised frameworks, which could further enhance performance. The focus on training-free methods may also limit applicability in scenarios where labeled data is available.

Broader Impact

The findings have significant implications for real-world applications in anomaly detection, particularly in industrial settings where rapid deployment and robustness to domain shifts are critical. The proposed methods could lead to more effective monitoring systems for machinery and environmental sounds, potentially reducing downtime and improving safety. The emphasis on training-free approaches also opens avenues for applications in resource-constrained environments. The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.

Analysis: Full Paper • Full text: 33,505 characters

FlowW2N: Whispered-to-Normal Speech Conversion via Flow-Matching

Fabian Ritter-Gutierrez, Md Asif Jalal, Pablo Peso Parada ... · arXiv

Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose Fl...

Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.

Institutional Affiliations

Primary: &D Institute UK (SRUK)

All Institutions: &D Institute UK (SRUK), Mobile eXperience Business, Republic of Korea

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.

Comprehensive Analysis

Methodology Assessment

The proposed FlowW2N method introduces a novel conditional flow matching approach that effectively addresses the challenges of whispered-to-normal speech conversion, particularly the temporal misalignment and lack of paired data. By leveraging synthetic data and domain-invariant ASR embeddings, the authors successfully sidestep traditional alignment issues, which is a significant advancement in the field. The architecture employs a Diffusion Transformer and utilizes a Gaussian prior for generation, which is innovative and well-justified. The methodology is clearly articulated, with a systematic exploration of different conditioning mechanisms and layer selection criteria that enhance the model's performance.

Experimental Evaluation

The experiments are comprehensive, utilizing two well-established datasets (CHAINS and wTIMIT) to evaluate the model's performance. The results demonstrate a significant reduction in Word Error Rate (WER) compared to prior methods, achieving state-of-the-art intelligibility. The paper includes ablation studies that provide insights into the contributions of various components of the model, reinforcing the robustness of the findings. The evaluation metrics are appropriate and well-defined, ensuring that the results are credible and reproducible.

Reproducibility

While the paper provides a detailed description of the methodology and experimental setup, it lacks a publicly available code repository or demo URL, which hinders reproducibility. The authors mention using internal generative AI tools for language refinement, but there is no indication of whether the model or data will be made available for further research.

Limitations

One limitation is the reliance on synthetic data for training, which may not fully capture the complexities of real-world whispered speech. Additionally, while the model shows impressive performance on the evaluated datasets, its generalizability to other languages or dialects is not addressed. The absence of a demo or code repository also limits the accessibility of the research for further validation by the community.

Broader Impact

The implications of this research are significant, particularly in applications involving speech recognition and synthesis for individuals with speech impairments or in noisy environments. The ability to convert whispered speech to normal speech could enhance communication for those who rely on whispering due to various reasons, thus broadening accessibility in technology. The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.

Analysis: Full Paper • Full text: 16,685 characters

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

Szu-Wei Fu, Rong Chao, Xuesong Yang ... · arXiv

Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved....

Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Code and models will be released upon acceptance.

Institutional Affiliations

Primary: NVIDIA

All Institutions: NVIDIA, Academia Sinica, Taipei, Taiwan

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a novel approach to Universal Speech Enhancement that significantly improves speech quality and ASR performance by rethinking training targets and leveraging a two-stage model framework. This work addresses critical gaps in the field and provides a solid foundation for future research and applications in speech processing.

Comprehensive Analysis

Methodology Assessment

The paper presents a systematic approach to Universal Speech Enhancement (USE) by addressing three critical challenges: training target selection, the distortion-perception tradeoff, and data quality. The authors propose using time-shifted anechoic clean speech as a learning target, which is shown to outperform conventional early-reflected speech. They also introduce a two-stage framework that combines regression and generative models to balance fidelity and perceptual quality effectively. This methodology is well-grounded in theoretical principles and is supported by empirical evidence.

Experimental Evaluation

The experiments are comprehensive, utilizing the URGENT 2025 Challenge dataset, which includes diverse speech distortions and languages. The authors provide detailed results that demonstrate significant improvements in both perceptual quality and automatic speech recognition (ASR) performance. The evaluation metrics are robust, covering both intrusive and non-intrusive measures, which strengthens the validity of their findings.

Reproducibility

The authors commit to releasing their code and models upon acceptance, which is a positive step towards reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, such as hyperparameters and specific training procedures, to facilitate replication.

Limitations

One notable limitation is the reliance on the URGENT 2025 Challenge dataset, which may not fully represent real-world conditions. Additionally, while the proposed method shows improvements, the paper does not extensively discuss scenarios where the model may fail or the potential for overfitting to the training data.

Broader Impact

The advancements in speech enhancement have significant implications for various applications, including telecommunications, assistive technologies, and improving the quality of training data for text-to-speech systems. The language-agnostic nature of the proposed method could also benefit low-resource languages, enhancing accessibility and communication. The main contribution of this paper is the introduction of a novel approach to Universal Speech Enhancement that significantly improves speech quality and ASR performance by rethinking training targets and leveraging a two-stage model framework. This work addresses critical gaps in the field and provides a solid foundation for future research and applications in speech processing.

Analysis: Full Paper • Full text: 49,884 characters

Single Microphone Own Voice Detection based on Simulated Transfer Functions for Hearing Aids

Mathuranathan Mayuravaani, W. Bastiaan Kleijn, Andrew Lensen ... · arXiv

This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increa...

This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model's ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.

Institutional Affiliations

Primary: Victoria University of Wellington

All Institutions: Victoria University of Wellington, GN ReSound

ML Relevance Analysis (83)

The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.

Comprehensive Analysis

Methodology Assessment

The paper proposes a novel approach to own voice detection (OVD) in hearing aids using a single microphone by leveraging simulated acoustic transfer functions (ATFs) for data augmentation. The methodology is well-structured, involving a two-stage simulation-based ATF generation pipeline that transitions from a rigid-sphere model to a detailed head-and-torso representation. The use of a transformer-based classifier enhances the model's ability to learn from spatial propagation cues, which is a significant advancement over traditional methods that rely on multiple microphones or complex signal processing techniques. The hierarchical adaptation strategy employed to progressively fine-tune the model is a commendable aspect, allowing for improved generalization from simulated to real-world conditions.

Experimental Evaluation

The experimental results demonstrate high accuracy rates, achieving 95.52% on simulated head-and-torso test data and 80% on real hearing aid recordings without fine-tuning. The use of diverse datasets, including VoxCeleb1 and LibriSpeech, alongside real-world recordings, adds robustness to the evaluation. The model's performance under varying noise conditions was also assessed, showcasing its resilience, which is crucial for practical applications in hearing aids.

Reproducibility

While the paper provides a detailed description of the methodology and experimental setup, it lacks specific URLs or repositories for code and data, which would enhance reproducibility. The absence of a demo or project URL limits the ability for other researchers to replicate the findings directly. However, the comprehensive description of the data augmentation process and model training strategies offers a solid foundation for future implementations.

Limitations

One limitation is the reliance on simulated data for training, which may not fully capture the complexities of real-world acoustic environments. The model's performance on real recordings, while promising, may still be affected by factors not accounted for in the simulations. Additionally, the study focuses on offline segment-level detection, leaving out considerations for real-time applications, which are critical in hearing aid technology.

Broader Impact

The proposed method has significant implications for the design of hearing aids, particularly in enhancing user comfort and speech intelligibility without increasing device complexity or cost. By enabling effective OVD with a single microphone, this research could lead to more accessible hearing aid solutions for individuals with hearing impairments, potentially improving their quality of life. The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.

Analysis: Full Paper • Full text: 43,313 characters

The PARLO Dementia Corpus: A German Multi-Center Resource for Alzheimer's Disease

Franziska Braun, Christopher Witzl, Florian Hönig ... · LREC 2026

Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impa...

Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.

Institutional Affiliations

Primary: PARLO Institute for Research and Teaching in Speech Therapy

All Institutions: PARLO Institute for Research and Teaching in Speech Therapy

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the PARLO Dementia Corpus, a clinically validated German resource for Alzheimer's disease research, which addresses a critical gap in available datasets for non-English languages. This work provides a comprehensive framework for future studies in speech-based cognitive assessment and establishes a benchmark for multi-modal research in neurodegenerative diseases.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is robust, involving the collection of a diverse dataset from multiple centers, which enhances the generalizability of the findings. The use of a standardized test battery for eliciting speech data is a significant strength, as it allows for systematic comparisons across different cognitive tasks. The detailed transcription process and the inclusion of demographic and clinical metadata further enrich the dataset, making it a valuable resource for future research. The integration of automatic speech recognition (ASR) systems and large language models (LLMs) for cognitive assessment demonstrates a forward-thinking approach, leveraging current advancements in AI.

Experimental Evaluation

The experiments conducted provide a solid foundation for evaluating the utility of the PARLO Dementia Corpus. The ASR benchmarking results are particularly noteworthy, showing a clear correlation between cognitive status and transcription accuracy. The automatic test evaluation results validate the effectiveness of the proposed scoring methods, achieving high correlation coefficients with human evaluations. The LLM-based classification experiments illustrate the potential for automated cognitive assessment, although the zero-shot classification approach may benefit from further refinement and validation.

Reproducibility

The paper outlines a clear methodology for data collection, transcription, and experimental setup, which supports reproducibility. However, the lack of publicly available code or a project URL limits the ease with which other researchers can replicate the experiments. Providing access to the dataset and a detailed description of the experimental setup would enhance reproducibility.

Limitations

One limitation of the study is the relatively small sample size of 208 participants, which may affect the statistical power of the findings. Additionally, while the dataset is a significant step forward for German-language research, it may not fully capture the diversity of speech patterns across different demographics or regions within Germany. The reliance on ASR systems also introduces potential biases, as these systems may struggle with disordered speech typical of dementia patients.

Broader Impact

The PARLO Dementia Corpus has the potential to significantly impact the field of cognitive impairment research, particularly in non-English speaking populations. It opens avenues for the development of automated screening tools that could facilitate early detection of Alzheimer's disease, ultimately improving patient outcomes. The dataset's compatibility with existing English-language resources enhances its utility for cross-lingual research, promoting a more inclusive approach to cognitive health studies. The main contribution of this paper is the introduction of the PARLO Dementia Corpus, a clinically validated German resource for Alzheimer's disease research, which addresses a critical gap in available datasets for non-English languages. This work provides a comprehensive framework for future studies in speech-based cognitive assessment and establishes a benchmark for multi-modal research in neurodegenerative diseases.

Analysis: Full Paper • Full text: 32,843 characters

Bias and Fairness in Self-Supervised Acoustic Representations for Cognitive Impairment Detection

Kashaf Gulzar, Korbinian Riedhammer, Elmar Nöth ... · arXiv

Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents...

Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6\%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power ($AUC$: 0.769 and 0.746, respectively) and substantial specificity disparities ($Δ_{spec}$ up to 18\% and 15\%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.

Institutional Affiliations

Primary: University of Antioquia

All Institutions: University of Antioquia, Technische Hochschule Nürnberg, Friedrich-Alexander Universität Erlangen-Nürnberg

GitHub

ML Relevance Analysis (82)

This study systematically investigates bias in self-supervised acoustic representations for cognitive impairment detection, revealing significant performance disparities across demographic and clinical subgroups. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights into the fairness and reliability of machine learning models in clinical applications.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, employing a systematic bias analysis of acoustic representations for cognitive impairment detection. The comparison of traditional acoustic features with self-supervised embeddings from Wav2Vec 2.0 is well-structured, and the evaluation across demographic and clinical subgroups adds significant depth to the analysis. The use of multiple classifiers and detailed bias metrics enhances the rigor of the methodology.

Experimental Evaluation

The experiments are comprehensive, utilizing a well-defined dataset (DementiaBank Pitt Corpus) and addressing class imbalance through various balancing strategies. The results are clearly presented, demonstrating the performance of different acoustic features and classifiers, with a focus on subgroup-specific metrics. However, the performance for depression detection is notably lower, which raises questions about the model's effectiveness in this area.

Reproducibility

The paper provides a GitHub repository with source code and audio filenames, which supports reproducibility. However, the raw audio recordings cannot be shared due to restrictions, which may limit the ability of other researchers to fully replicate the study.

Limitations

The study is limited by its reliance on a single dataset, which may not capture the full demographic and linguistic diversity of clinical populations. Additionally, the small number of depression-labeled samples in the dataset may affect the robustness of the conclusions regarding depression classification.

Broader Impact

The findings highlight critical issues related to bias and fairness in machine learning applications in healthcare, particularly in speech-based diagnostics for cognitive impairment. The work underscores the importance of fairness-aware evaluation protocols, which could influence future research and clinical practices in AI-driven healthcare solutions. This study systematically investigates bias in self-supervised acoustic representations for cognitive impairment detection, revealing significant performance disparities across demographic and clinical subgroups. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights into the fairness and reliability of machine learning models in clinical applications.

Analysis: Full Paper • Full text: 50,026 characters

The PARLO Dementia Corpus: A German Multi-Center Resource for Alzheimer's Disease

Franziska Braun, Christopher Witzl, Florian Hönig ... · LREC 2026

Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impa...

Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.

Institutional Affiliations

Primary: PARLO Institute for Research and Teaching in Speech Therapy

All Institutions: PARLO Institute for Research and Teaching in Speech Therapy

ML Relevance Analysis (82)

The paper introduces the PARLO Dementia Corpus, a pioneering resource for Alzheimer's disease research in German, enabling innovative approaches to cognitive assessment through speech analysis. The comprehensive dataset and its validation through rigorous experiments position it as a valuable contribution to the fields of speech technology and clinical neuroscience.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, involving a multi-center design that ensures diversity in the dataset. The use of standardized neuropsychological tasks for speech elicitation is a significant strength, as it allows for a comprehensive assessment of cognitive function. The detailed transcription process enhances the dataset's utility for various analyses. However, the reliance on a specific demographic (German-speaking individuals) may limit generalizability to other languages and cultures.

Experimental Evaluation

The experiments conducted, including ASR benchmarking and LLM-based classification, are well-structured and demonstrate the dataset's applicability. The results indicate a clear correlation between automatic evaluations and human assessments, validating the dataset's potential for clinical applications. The choice of models and the evaluation metrics used are appropriate, though further exploration of different ASR systems could enhance understanding of performance across varied conditions.

Reproducibility

The paper provides sufficient detail regarding the data collection, transcription, and experimental setup, which supports reproducibility. However, the absence of publicly accessible code or a demo for the models used limits the ease with which other researchers can replicate the findings.

Limitations

The study's limitations include the potential biases inherent in a multi-center study, such as variability in participant recruitment and testing conditions. Additionally, while the dataset is comprehensive, it may not capture the full spectrum of cognitive impairment across different languages or cultural contexts. The focus on German may restrict broader applicability.

Broader Impact

The PARLO Dementia Corpus has significant implications for both clinical and research applications in the field of cognitive impairment. By providing a publicly available dataset, it facilitates advancements in automatic speech recognition and cognitive assessment tools, potentially leading to earlier detection and better management of Alzheimer's disease. The corpus also sets a precedent for future multilingual studies in speech analysis related to cognitive health. The paper introduces the PARLO Dementia Corpus, a pioneering resource for Alzheimer's disease research in German, enabling innovative approaches to cognitive assessment through speech analysis. The comprehensive dataset and its validation through rigorous experiments position it as a valuable contribution to the fields of speech technology and clinical neuroscience.

Analysis: Full Paper • Full text: 32,864 characters

Does Fine-tuning by Reinforcement Learning Improve Generalization in Binary Speech Deepfake Detection?

Xin Wang, Ge Wanying, Junichi Yamagishi · arXiv

Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (S...

Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.

Institutional Affiliations

Primary: National Institute of Informatics

All Institutions: National Institute of Informatics

ML Relevance Analysis (78)

The main contribution of this paper is the introduction of a reinforcement learning-based fine-tuning approach for speech deepfake detection, demonstrating improved generalization capabilities compared to traditional supervised methods. This work significantly advances the understanding of model training in the context of speech deepfake detection and opens avenues for future research in applying reinforcement learning techniques to other domains within machine learning.

Comprehensive Analysis

Methodology Assessment

The methodology employed in this paper is robust, leveraging a novel approach of applying Group Relative Policy Optimization (GRPO) for fine-tuning speech deepfake detection models. The authors effectively draw parallels between the fine-tuning processes in speech and large language models, providing a clear rationale for their choice of GRPO over traditional supervised fine-tuning (SFT). The detailed description of the training paradigm, including the formulation of the loss functions and the experimental setup, demonstrates a comprehensive understanding of the problem space. However, the paper could benefit from clearer visual aids and more explicit definitions of the terms used in the equations, which may enhance reader comprehension.

Experimental Evaluation

The experimental evaluation is thorough, utilizing multiple detectors and diverse test sets to assess the performance of GRPO against SFT and hybrid setups. The results are well-presented, showing clear improvements in out-of-domain generalization without sacrificing in-domain performance. The use of ablation studies to isolate the effects of the negative reward and regularization term adds depth to the analysis. However, the paper lacks a comparative analysis with other state-of-the-art methods in speech deepfake detection, which could contextualize the results more effectively.

Reproducibility

The paper outlines the experimental setup and model configurations in detail, which is essential for reproducibility. However, the lack of publicly available code or a clear project URL limits the ability for other researchers to replicate the findings. Providing a GitHub repository with the implementation would significantly enhance reproducibility and foster further research in this area.

Limitations

One limitation is the absence of a comparative analysis with other advanced techniques in the field of speech deepfake detection, which could provide a more comprehensive understanding of the GRPO approach's relative performance. Additionally, the paper does not address the potential computational overhead introduced by the GRPO fine-tuning process compared to SFT, which could be a consideration for practical applications.

Broader Impact

The findings of this research have significant implications for the field of speech deepfake detection, particularly in enhancing the robustness of models against unseen attacks. As deepfake technology continues to evolve, improving detection methods is crucial for maintaining the integrity of audio content across various applications, including media, security, and communication. The insights gained from this study could inspire further innovations in model training paradigms and contribute to the development of more resilient AI systems. The main contribution of this paper is the introduction of a reinforcement learning-based fine-tuning approach for speech deepfake detection, demonstrating improved generalization capabilities compared to traditional supervised methods. This work significantly advances the understanding of model training in the context of speech deepfake detection and opens avenues for future research in applying reinforcement learning techniques to other domains within machine learning.

Analysis: Full Paper • Full text: 20,085 characters

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Hashim Ali, Nithin Sai Adupa, Surya Subramani ... · ICASSP

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we...

Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.

Institutional Affiliations

Primary: University of Michigan

All Institutions: University of Michigan

ML Relevance Analysis (83)

The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel benchmarking framework, Spoof-SUPERB, specifically designed for evaluating self-supervised learning (SSL) models in the context of audio deepfake detection. The methodology is well-structured, utilizing a unified protocol for training and evaluation across multiple datasets, which enhances comparability. The choice of models and the systematic evaluation of their performance under various conditions, including acoustic degradations, is a significant strength. However, the paper could benefit from a more detailed discussion of the specific training and evaluation protocols used, as well as the rationale behind the selection of datasets.

Experimental Evaluation

The experiments are comprehensive, involving 20 different SSL models evaluated on multiple datasets, which provides a robust analysis of model performance. The results clearly demonstrate the superiority of large-scale discriminative models over generative ones, particularly in terms of resilience to noise and other acoustic degradations. The use of Equal Error Rate (EER) as a performance metric is appropriate for the task, although additional metrics could provide a more nuanced view of model performance.

Reproducibility

The paper emphasizes reproducibility by establishing a fixed training setup and evaluation protocol, which is crucial for benchmarking in machine learning. However, the absence of a publicly accessible code repository or detailed implementation guidelines limits the ability of other researchers to reproduce the results. Providing such resources would significantly enhance the paper's impact.

Limitations

One limitation is the potential overlap between the pretraining data of some models and the evaluation datasets, which could bias the results. Additionally, while the paper addresses robustness under various acoustic conditions, it does not explore the implications of different synthesis methods for audio deepfakes, which could be a critical area for future research.

Broader Impact

The introduction of a standardized benchmark for audio deepfake detection is a timely contribution, given the increasing prevalence of deepfake technologies and their implications for security and trust in audio communications. This work could pave the way for further advancements in antispoofing techniques and the development of more secure speech processing systems. The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.

Analysis: Full Paper • Full text: 17,053 characters

Analytical Exploration of Spatial Audio Cues: A Differentiable Multi-Sphere Scattering Model

Siminfar Samakoush Galougah, Pranav Pulijala, Ramani Duraiswami · arXiv

A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level a...

A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level and time differences (ITD/ILD). While Head-Related Transfer Function (HRTF) models based on rigid scattering suffice for terrestrial humans, they fail in underwater environments due to the near-impedance match between water and soft tissue. Motivated by the acoustic anatomy of underwater animals, we introduce a novel, analytically derived, closed-form forward model for scattering from a semi-transparent sphere containing two rigid spherical scatterers. This model accurately maps source direction, frequency, and material properties to the pressure field, capturing the complex physics of layered, penetrable structures. Critically, our model is implemented in a fully differentiable setting, enabling its integration with a machine learning algorithm to optimize a cost function for active localization. We demonstrate enhanced convergence for localization under noise using a physics-informed frequency weighting scheme, and present accurate moving-source tracking via an Extended Kalman Filter (EKF) with analytically computed Jacobians. Our work suggests that differentiable models of scattering from layered rigid and transparent geometries offer a promising new foundation for microphone arrays that leverage scattering-based spatial cues over conventional beamforming, applicable to both terrestrial and underwater applications. Our model will be made open source.

Institutional Affiliations

Primary: University of Maryland

All Institutions: University of Maryland, SDU | All: & Reality Lab

ML Relevance Analysis (83)

The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel analytical framework for modeling sound scattering in underwater environments, utilizing a differentiable multi-sphere scattering model. The approach is grounded in biological principles of spatial hearing and employs multipole expansions to derive a closed-form solution. The implementation in a differentiable programming framework (JAX) allows for efficient gradient-based optimization, which is a significant advancement over traditional methods that do not provide gradients. This differentiability enables the integration of the model with machine learning algorithms for active localization, showcasing a well-thought-out methodology that bridges physics-based modeling and machine learning.

Experimental Evaluation

The experiments conducted validate the proposed model through simulations that demonstrate its ability to accurately capture interaural level differences (ILD) and interaural time differences (ITD) under various conditions. The results show that the model effectively generates realistic binaural cues and performs robustly in source localization tasks, even under noise. The use of an Extended Kalman Filter (EKF) for tracking moving sources further emphasizes the practical applicability of the model. The experiments are comprehensive, covering various source directions and noise levels, which strengthens the findings.

Reproducibility

While the paper mentions that the model will be made open source, specific details regarding the implementation and access to the code are not provided. This lack of direct access to the code and data limits the reproducibility of the results. However, the detailed methodology and equations presented allow for potential replication by researchers with sufficient expertise in the field.

Limitations

The primary limitation of the study is the reliance on a simplified geometric model that may not capture all complexities of real-world underwater environments. Additionally, the experiments are conducted in a controlled simulation setting, which may not fully represent the challenges faced in practical applications, such as reverberation and multi-source scenarios. The model's performance in more complex acoustic environments remains to be tested.

Broader Impact

This research has significant implications for the development of advanced acoustic sensing systems, particularly in underwater environments where traditional methods struggle. The ability to accurately model sound scattering and utilize spatial cues for localization can enhance various applications, including marine biology research, underwater navigation, and surveillance. The open-source nature of the model could foster further research and development in this area, promoting collaboration and innovation. The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.

Analysis: Full Paper • Full text: 46,943 characters

DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement

Minghui Wu, Xueling Liu, Jiahuan Fan ... · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, Singapore, 2025, pp. 1104-1109 · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pa...

Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.

Institutional Affiliations

Primary: University of Science and Technology of China

All Institutions: University of Science and Technology of China, iFlytek Co., Ltd., Huawei Technology

ML Relevance Analysis (83)

The main contribution of this paper is the development of the DARS framework, which effectively synthesizes dysarthric speech to enhance automatic speech recognition performance, addressing a critical gap in assistive technology for individuals with speech impairments. The combination of innovative methodologies and rigorous experimental validation positions this work as a significant advancement in the field of speech synthesis and recognition.

Comprehensive Analysis

Methodology Assessment

The DARS framework introduces innovative mechanisms for synthesizing dysarthric speech, specifically a multi-stage rhythm predictor and a dysarthria-aware conditional flow matching mechanism. The use of contrastive preference optimization to guide the rhythm predictor is particularly novel, as it directly addresses the variability in dysarthric speech patterns. The integration of pause modeling and acoustic style vectors enhances the synthesis quality, making the approach well-suited for the complexities of dysarthric speech.

Experimental Evaluation

The paper presents a thorough experimental evaluation using the TORGO dataset, demonstrating the effectiveness of the DARS framework in enhancing ASR performance. The reported results, including a 54.22% relative reduction in WER, indicate significant improvements over existing methods. The experiments are well-structured, comparing multiple training strategies and adaptation techniques, which adds robustness to the findings.

Reproducibility

While the paper provides a detailed description of the methodology and experimental setup, the absence of URLs for code or demo pages limits reproducibility. Clearer documentation or supplementary materials would enhance the ability for others to replicate the results.

Limitations

The study relies on a limited dataset (TORGO), which may affect the generalizability of the results. Additionally, while the framework shows promise, the performance on more diverse dysarthric speech samples and real-world scenarios remains to be validated.

Broader Impact

The DARS framework has the potential to significantly improve communication aids for individuals with dysarthria, enhancing their quality of life. By improving ASR systems' ability to recognize dysarthric speech, this research could facilitate better interaction and accessibility for affected individuals in various settings. The main contribution of this paper is the development of the DARS framework, which effectively synthesizes dysarthric speech to enhance automatic speech recognition performance, addressing a critical gap in assistive technology for individuals with speech impairments. The combination of innovative methodologies and rigorous experimental validation positions this work as a significant advancement in the field of speech synthesis and recognition.

Analysis: Full Paper • Full text: 20,783 characters

End-to-End Simultaneous Dysarthric Speech Reconstruction with Frame-Level Adaptor and Multiple Wait-k Knowledge Distillation

Minghui Wu, Haitao Tang, Jiahuan Fan ... · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Singapore, 2025, pp. 1092-1097 · 2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly...

Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: https://wflrz123.github.io/

Institutional Affiliations

Primary: University of Science and Technology of China

All Institutions: University of Science and Technology of China, iFlytek Co., Ltd.

Demo

ML Relevance Analysis (83)

The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.

Comprehensive Analysis

Methodology Assessment

The proposed end-to-end simultaneous dysarthric speech reconstruction (E2E-SDSR) system introduces innovative components such as a frame-level adaptor module and a multiple wait-k autoregressive TTS module. The frame-level adaptor effectively bridges the gap between ASR and TTS, enhancing the robustness of the system against ASR errors through explicit-implicit semantic information fusion. The multiple wait-k strategy in the TTS module allows for flexibility in processing, balancing latency and prosody quality. The methodology is well-structured, with a clear focus on addressing the unique challenges posed by dysarthric speech, particularly in terms of intelligibility and naturalness.

Experimental Evaluation

The experiments are comprehensive, utilizing both a commercial dysarthric speech dataset and the UASpeech dataset. The reported results, including a mean opinion score (MOS) of 4.67 and a 54.25% reduction in word error rate (WER), demonstrate significant improvements over existing methods. The ablation studies provide valuable insights into the contributions of each component of the proposed system, reinforcing the effectiveness of the adaptor and wait-k strategies.

Reproducibility

While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as hyperparameter settings, training duration, and the exact architecture configurations used. The absence of a publicly available code repository limits the reproducibility of the results.

Limitations

The study primarily focuses on dysarthric speech and may not generalize well to other speech disorders or languages. Additionally, the reliance on a limited dataset for training and testing could affect the robustness of the model in real-world applications. The paper does not address potential biases in the dataset or the implications of using commercial data.

Broader Impact

The proposed system has the potential to significantly improve communication for individuals with dysarthria, enhancing their quality of life and social interactions. By providing a more efficient and intelligible speech reconstruction method, it could be applied in various assistive technologies and communication devices. The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.

Analysis: Full Paper • Full text: 20,357 characters

The USTC-NERCSLIP Systems for the CHiME-9 MCoRec Challenge

Ya Jiang, Ruoyu Wang, Jingxuan Zhang ... · arXiv

This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialo...

This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription accuracy. Our best single cascaded system achieves a Speaker Word Error Rate (WER) of 32.44% on the development set. By further applying ROVER to fuse outputs from diverse front-end and back-end variants, we reduce Speaker WER to 31.40%. Notably, our LLM-based zero-shot conversational clustering achieves a speaker clustering F1 score of 1.0, yielding a final Joint ASR-Clustering Error Rate (JACER) of 15.70%.

Institutional Affiliations

Primary: University of Science and Technology of China

All Institutions: Anhui University, Lomonosov Moscow State University, iFLYTEK Research, Shaanxi Normal University, University of Science and Technology of China, iFLYTEK Co

ML Relevance Analysis (83)

This paper makes a notable contribution to the field of audio-visual speech recognition and clustering by proposing an integrated framework that effectively addresses the challenges posed by overlapping conversations in complex acoustic environments. The technical contributions, particularly the innovative use of LLMs for conversation clustering, position this work as a significant advancement in the domain.

Comprehensive Analysis

Methodology Assessment

The paper presents a sophisticated multimodal cascaded system that integrates audio-visual data to tackle the complex problem of recognizing and clustering multiple concurrent conversations. The methodology is robust, employing a two-stage transfer learning strategy for Active Speaker Detection (ASD) and a comprehensive approach for Audio-Visual Target Speech Extraction (AVTSE) and Audio-Visual Speech Recognition (AVSR). The incorporation of large language models (LLMs) for conversational clustering is particularly innovative, leveraging semantic understanding to enhance accuracy. The use of diverse datasets for training and the detailed architecture of each system component demonstrate a thorough and well-thought-out methodology.

Experimental Evaluation

The experiments are extensive, with a clear focus on evaluating the performance of each component in the pipeline. The results indicate significant improvements over baseline models, particularly in Speaker Word Error Rate (WER) and Joint ASR-Clustering Error Rate (JACER). The paper provides detailed comparisons across different system configurations, showcasing the effectiveness of the proposed methods. However, the absence of a comprehensive ablation study to isolate the contributions of each component limits the depth of the evaluation.

Reproducibility

While the paper outlines the methodologies and datasets used, it lacks specific implementation details that would aid in reproducing the results. There are no links to code repositories or supplementary materials, which is a significant drawback for reproducibility in machine learning research.

Limitations

The paper does not address potential limitations of the proposed systems, such as the computational cost associated with using large language models and the challenges of real-time application in practical scenarios. Additionally, the reliance on extensive datasets may not be feasible for all research groups, limiting the accessibility of the proposed methods.

Broader Impact

The work has significant implications for applications in real-world scenarios involving multi-party conversations, such as meetings, conferences, and social interactions. The ability to accurately recognize and cluster overlapping speech can enhance communication technologies, assistive devices, and automated transcription services, contributing to advancements in human-computer interaction. This paper makes a notable contribution to the field of audio-visual speech recognition and clustering by proposing an integrated framework that effectively addresses the challenges posed by overlapping conversations in complex acoustic environments. The technical contributions, particularly the innovative use of LLMs for conversation clustering, position this work as a significant advancement in the domain.

Analysis: Full Paper • Full text: 15,566 characters

VietSuperSpeech: A Large-Scale Vietnamese Conversational Speech Dataset for ASR Fine-Tuning in Chatbot, Customer Support, and Call Center Applications

Loan Do, Thanh Ngoc Nguyen, Thanh Pham ... · arXiv

We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read spe...

We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read speech, news narration, or audiobook content, VietSuperSpeech is sourced from four publicly accessible YouTube channels spanning everyday conversation, personal vlogging, overseas Vietnamese community dialogue, and informal commentary - the very speech styles encountered in real-world chatbot, customer support, call center, and hotline deployments. All audio is standardized to 16 kHz mono PCM WAV and segmented into 3-30 second utterances. Transcriptions are generated via pseudo-labeling using the Zipformer-30M-RNNT-6000h model (Nguyen, 2025) deployed through Sherpa-ONNX, pre-trained on 6,000 hours of Vietnamese speech. After quality filtering, the dataset is split into 46,822 training samples (240.67 hours) and 5,201 development/test samples (26.72 hours) with a fixed random seed. The text averages 266 characters per utterance, totaling 13.8 million fully diacritically marked Vietnamese characters. We demonstrate that VietSuperSpeech fills a critical gap in the Vietnamese ASR ecosystem: while corpora such as VLSP2020, VIET_BUD500, VietSpeech, FLEURS, VietMed, Sub-GigaSpeech2-Vi, viVoice, and Sub-PhoAudioBook provide broad coverage of formal and read speech, none specifically targets the casual, spontaneous register indispensable for conversational AI applications. VietSuperSpeech is publicly released at https://huggingface.co/datasets/thanhnew2001/VietSuperSpeech.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

GitHub

ML Relevance Analysis (80)

The main contribution of this work is the introduction of VietSuperSpeech, a large-scale dataset specifically designed for casual conversational speech in Vietnamese, which fills a critical gap in the existing ASR corpus landscape. This dataset's unique focus on informal speech patterns and its potential applications in various conversational AI domains make it a significant resource for advancing ASR technology in low-resource languages.

Comprehensive Analysis

Methodology Assessment

The methodology of VietSuperSpeech is robust, focusing on the collection of conversational speech from diverse YouTube channels, which is a significant departure from existing datasets that primarily feature formal speech. The use of pseudo-labeling through the Zipformer-30M-RNNT-6000h model is well-justified, and the quality control measures implemented during transcription generation strengthen the dataset's reliability. However, the paper could benefit from a more detailed description of the pseudo-labeling quality assessment process and the specific metrics used to evaluate the performance of the ASR model on this dataset.

Experimental Evaluation

The paper does not provide extensive experimental results demonstrating the effectiveness of the VietSuperSpeech dataset in improving ASR performance in conversational contexts. While the authors discuss the dataset's intended applications and the acoustic properties of the speech it contains, empirical validation through experiments that compare ASR performance on this dataset versus existing corpora would significantly enhance the paper's impact.

Reproducibility

The authors have made the dataset publicly available, which is a positive step towards reproducibility. The details regarding the audio preprocessing and pseudo-labeling pipeline are adequately described, allowing other researchers to replicate the dataset creation process. However, the lack of shared experimental results or code for training ASR models on this dataset limits the overall reproducibility of the findings.

Limitations

The paper acknowledges several limitations, including the potential for pseudo-label noise and the demographic balance of the speaker population. The dataset's reliance on YouTube content may also restrict its representativeness of all conversational registers, particularly in specialized domains. Additionally, the authors note that the dataset may not fully capture the nuances of highly noisy environments typical in call centers.

Broader Impact

VietSuperSpeech has significant implications for the development of ASR systems in Vietnamese, particularly for applications in customer support, chatbots, and IVR systems. By addressing the gap in conversational speech datasets, it provides a valuable resource for researchers and practitioners aiming to improve ASR performance in real-world scenarios. The dataset's public availability encourages further research and development in this area, potentially leading to advancements in Vietnamese language technology. The main contribution of this work is the introduction of VietSuperSpeech, a large-scale dataset specifically designed for casual conversational speech in Vietnamese, which fills a critical gap in the existing ASR corpus landscape. This dataset's unique focus on informal speech patterns and its potential applications in various conversational AI domains make it a significant resource for advancing ASR technology in low-resource languages.

Analysis: Full Paper • Full text: 25,972 characters

When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus

Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov ... · arXiv

We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To eva...

We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/MTUCI/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}

Institutional Affiliations

Primary: Moscow Technical University of Communications and Informatics

All Institutions: Moscow Technical University of Communications and Informatics

GitHub

ML Relevance Analysis (77)

The main contribution of this paper is the introduction of LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection. This work significantly advances the field by providing a valuable resource for evaluating spoof detection systems across a wide range of languages, particularly those that are often underrepresented in existing research.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust, focusing on the creation of a large-scale multilingual synthetic-speech corpus specifically designed for cross-lingual spoof detection. The authors employed 24 open-source TTS systems to generate 2,732 hours of audio across 66 languages, which is a significant contribution to the field of anti-spoofing research, especially for low-resource languages. The use of threshold transfer for evaluating countermeasures is a clever approach that allows for the assessment of model performance without the need for bonafide speech from the target domain. However, the paper could benefit from a more detailed explanation of the calibration process and the specific metrics used for evaluating the countermeasures.

Experimental Evaluation

The experiments conducted are comprehensive, benchmarking 11 publicly available countermeasures and reporting on the spoof rejection rate (SRR). The results indicate a model-dependent cross-lingual disparity, which is an important finding that highlights the challenges in spoof detection across different languages. The controlled conditions of the experiments lend credibility to the findings, but the paper could improve by providing more detailed statistical analyses and comparisons between the different models tested.

Reproducibility

The dataset is publicly available, which is a significant step towards ensuring reproducibility. However, the paper does not provide sufficient details regarding the implementation of the countermeasures or the specific configurations used during the experiments. Including code or detailed experimental setups would enhance reproducibility and allow other researchers to validate the findings more easily.

Limitations

One limitation of the study is the reliance on synthetic speech, which may not fully capture the complexities and variabilities present in real-world scenarios. Additionally, the focus on 66 languages, while commendable, may overlook certain dialects or variations within those languages that could affect spoof detection performance. The paper also does not address the potential biases introduced by the TTS systems used for generating the audio samples.

Broader Impact

The implications of this research are significant, particularly in the context of increasing reliance on voice-based authentication systems. By addressing spoof detection in low-resource languages, the study opens avenues for improving security measures in diverse linguistic contexts. The findings could influence future research directions and the development of more inclusive anti-spoofing technologies. The main contribution of this paper is the introduction of LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection. This work significantly advances the field by providing a valuable resource for evaluating spoof detection systems across a wide range of languages, particularly those that are often underrepresented in existing research.

Analysis: Full Paper • Full text: 1,842 characters

TQCodec: Towards neural audio codec for high-fidelity music streaming

Lixing He, Zhouxuan Chen, Mingshuai Liu ... · arXiv

We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from 32 kbps to 128 kbps, aligning with the st...

We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from 32 kbps to 128 kbps, aligning with the standard quality of modern music streaming platforms. The model adopts an encoder-decoder architecture based on SEANet for efficient on-device computation and introduces several enhancements: an imbalanced network design for improved quality with low overhead, SimVQ for mid-frequency detail preservation, and a phase-aware waveform loss. Additionally, we introduce a perception-driven band-wise bit allocation strategy to prioritize perceptually critical lower frequencies. Evaluations on diverse music datasets demonstrate that TQCodec achieves superior audio quality at target bitrates, making it well-suited for high-quality audio applications.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of TQCodec, a neural audio codec tailored for high-bitrate music streaming that incorporates several innovative techniques to enhance audio quality while maintaining computational efficiency. The paper presents a solid foundation for future research in high-fidelity audio codecs, though it requires improvements in methodological clarity and experimental rigor to fully realize its potential impact in the field.

Comprehensive Analysis

Methodology Assessment

The proposed TQCodec utilizes an encoder-decoder architecture based on SEANet, which is a significant adaptation for high-fidelity music streaming. The enhancements introduced, such as the imbalanced network design, SimVQ for mid-frequency detail preservation, and phase-aware waveform loss, are well thought out and address specific challenges in audio codec design. The perception-driven band-wise bit allocation strategy is particularly innovative, as it prioritizes perceptually critical frequencies, which is crucial in audio processing. However, the methodology could benefit from more detailed explanations of the implementation and the specific advantages of each enhancement over existing methods.

Experimental Evaluation

The evaluation of TQCodec on diverse music datasets is commendable, as it demonstrates the codec's performance across various conditions. The use of objective metrics like LSD and SNR to validate audio quality is appropriate, though the paper would benefit from comparative subjective listening tests to further substantiate claims of superior audio quality. The results indicate that TQCodec outperforms existing baselines, which is a strong point, but the paper lacks a thorough discussion on the datasets used, including their diversity and representativeness.

Reproducibility

The paper does not provide sufficient details on the implementation of TQCodec, which raises concerns about reproducibility. Key aspects such as hyperparameter settings, training procedures, and the specific datasets used for training and evaluation are not adequately described. Including this information would enhance the paper's reproducibility and allow other researchers to build upon the work.

Limitations

One limitation is the lack of subjective evaluation metrics, which are essential in audio quality assessment. Additionally, while the focus on high-bitrate codecs is valuable, the paper does not address how TQCodec performs at lower bitrates, which could limit its applicability in scenarios with constrained bandwidth. The computational efficiency claims should also be backed by more detailed performance benchmarks.

Broader Impact

TQCodec has the potential to significantly impact the field of audio streaming by providing a high-fidelity codec that meets the demands of modern music platforms. Its design is particularly relevant as the industry shifts towards higher quality audio streaming. The advancements in neural audio codecs could also influence related fields, such as music generation and audio synthesis, by providing more efficient and effective tools for audio processing. The main contribution of this paper is the introduction of TQCodec, a neural audio codec tailored for high-bitrate music streaming that incorporates several innovative techniques to enhance audio quality while maintaining computational efficiency. The paper presents a solid foundation for future research in high-fidelity audio codecs, though it requires improvements in methodological clarity and experimental rigor to fully realize its potential impact in the field.

Analysis: Full Paper • Full text: 5,483 characters

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

Pengfei Zhang, Tianxin Xie, Minghao Yang ... · arXiv

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is ...

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.

Institutional Affiliations

Primary: The Hong Kong University of Science and Technology (Guangzhou)

All Institutions: The Hong Kong University of Science and Technology (Guangzhou)

ML Relevance Analysis (83)

The paper presents AG-REPA, a framework that enhances audio generation quality by focusing on causal contributions of layers rather than mere representational richness. This innovative approach not only improves training efficiency but also contributes to the interpretability of generative models, marking a meaningful advancement in the field of machine learning.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel methodology called Attribution-Guided REPresentation Alignment (AG-REPA), which emphasizes causal layer selection for representation alignment in audio flow matching. This approach is grounded in the theoretical concept of Store-Contribute Dissociation (SCD), which reveals that layers rich in semantic information do not necessarily contribute most to the generative process. The methodology includes a forward-only gate ablation (FoG-A) to quantify each layer's causal contribution, allowing for adaptive layer selection and weighting. This is a significant advancement over traditional heuristic methods, providing a more principled basis for layer selection in generative models.

Experimental Evaluation

The experiments are robust, utilizing well-established datasets such as LibriSpeech and AudioSet for unified speech and general audio training. The results demonstrate that AG-REPA consistently outperforms baseline methods, achieving significant reductions in Fréchet Audio Distance (FAD) and improvements in perceptual quality metrics like Word Error Rate (WER) and Mean Opinion Score (MOS). The comparative analysis against static REPA baselines and other alignment strategies provides strong empirical support for the proposed method.

Reproducibility

The paper outlines a clear methodology and experimental setup, but lacks specific implementation details or code availability, which could hinder reproducibility. The authors mention a "probe-then-intervene" training protocol that separates diagnostic probing from optimization, which is a good practice for ensuring clean experimental conditions.

Limitations

One limitation is the lack of external validation on diverse datasets beyond LibriSpeech and AudioSet, which may limit the generalizability of the findings. Additionally, while the paper discusses potential risks associated with high-fidelity audio generation, it does not provide a detailed framework for mitigating these risks in practical applications.

Broader Impact

The work has significant implications for the field of audio generation, particularly in enhancing the intelligibility and quality of synthesized speech and audio. The interpretability toolkit developed in this study could also pave the way for more transparent and controllable generative models in AI, addressing some of the ethical concerns surrounding deepfake technologies. The paper presents AG-REPA, a framework that enhances audio generation quality by focusing on causal contributions of layers rather than mere representational richness. This innovative approach not only improves training efficiency but also contributes to the interpretability of generative models, marking a meaningful advancement in the field of machine learning.

Analysis: Full Paper • Full text: 43,295 characters

SyncTrack: Rhythmic Stability and Synchronization in Multi-Track Music Generation

Hongrui Wang, Fan Zhang, Zhiyuan Yu ... · ICLR 2026

Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between trac...

Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.

Institutional Affiliations

Primary: Hong Kong University of Science and Technology

All Institutions: Hong Kong University of Science and Technology

ML Relevance Analysis (83)

The paper presents SyncTrack, a novel model for synchronous multi-track music generation that significantly enhances rhythmic stability and synchronization. The technical contributions, including innovative architecture and evaluation metrics, position this work as a meaningful advancement in the field of machine learning for audio applications.

Comprehensive Analysis

Methodology Assessment

The paper introduces SyncTrack, a novel architecture for multi-track music generation that effectively addresses rhythmic stability and synchronization through the integration of track-shared and track-specific modules. The use of cross-track attention mechanisms is innovative, allowing for both global and time-specific synchronization of rhythms across tracks. The proposed metrics for evaluating rhythmic consistency (IRS, CBS, CBD) are well-conceived and fill a significant gap in the assessment of multi-track music generation quality. The methodology is clearly articulated, with a logical flow from problem identification to solution proposal.

Experimental Evaluation

The experiments are comprehensive, utilizing both objective metrics (FAD, IRS, CBS, CBD) and subjective evaluations to validate the performance of SyncTrack against state-of-the-art baselines. The results demonstrate significant improvements in both rhythmic stability and synchronization, with clear statistical backing. The use of the Slakh2100 dataset is appropriate for the task, and the ablation studies provide insights into the contributions of different components of the model.

Reproducibility

The paper includes detailed implementation information, including training configurations, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of a demo or project URL limits accessibility for other researchers wishing to replicate the work.

Limitations

While the proposed metrics are robust, the paper does not address potential limitations in the generalizability of SyncTrack across different musical genres or styles. Additionally, the reliance on specific datasets may introduce biases that affect the model's performance in broader applications.

Broader Impact

The advancements in multi-track music generation have significant implications for the music industry, particularly in areas such as music production, remixing, and creative applications. By improving rhythmic stability and synchronization, SyncTrack could enhance the quality of generated music, making it more suitable for professional use. The paper presents SyncTrack, a novel model for synchronous multi-track music generation that significantly enhances rhythmic stability and synchronization. The technical contributions, including innovative architecture and evaluation metrics, position this work as a meaningful advancement in the field of machine learning for audio applications.

Analysis: Full Paper • Full text: 46,029 characters

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

Yanir Marmor, Arad Zulti, David Krongauz ... · arXiv

Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 s...

Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.

Institutional Affiliations

Primary: Weizmann Institute of Science

All Institutions: Weizmann Institute of Science

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of VoxKnesset, a large-scale longitudinal Hebrew speech dataset that enables the study of aging effects on speech, along with a comprehensive evaluation of modern speech embeddings in this context. This work represents a significant advancement in the field of speech processing, particularly for underrepresented languages and demographic studies.

Comprehensive Analysis

Methodology Assessment

The methodology employed in VoxKnesset is robust and well-structured, focusing on the creation of a large-scale longitudinal speech dataset specifically for Hebrew parliamentary speech. The authors detail a multi-stage alignment pipeline that addresses common issues in audio processing, such as timestamp inconsistencies and transcript normalization artifacts. The use of verified demographic metadata enhances the dataset's reliability. The longitudinal aspect of the dataset is particularly noteworthy, as it allows for the examination of vocal changes over time, a significant advancement over traditional cross-sectional datasets. The benchmarking of modern speech embeddings on age prediction and speaker verification is methodologically sound, providing a clear framework for evaluating the impact of aging on speech characteristics.

Experimental Evaluation

The experiments conducted are thorough and well-articulated, demonstrating the dataset's utility in real-world applications. The authors benchmark several state-of-the-art speech embeddings, providing a comprehensive analysis of their performance in both age prediction and speaker verification tasks. The results clearly show the degradation of speaker verification performance over time, highlighting the importance of longitudinal data in understanding vocal aging. The cross-dataset evaluations further validate the dataset's applicability and the robustness of the findings across different languages and contexts. The use of various metrics, including Mean Absolute Error (MAE) and Equal Error Rate (EER), adds depth to the evaluation.

Reproducibility

While the paper mentions that the dataset and processing pipeline will be publicly released, specific implementation details are somewhat lacking. The authors provide a general overview of their methods, but more granular details about the experimental setup, hyperparameters, and the exact processing steps would enhance reproducibility. Clear documentation and access to code would be beneficial for other researchers looking to replicate or build upon this work.

Limitations

The paper acknowledges several limitations, including the dataset's focus on a single speech register (parliamentary debate) and a demographic skew towards older adults. Additionally, the authors note that recording conditions may have evolved over the 16 years, which could introduce confounding variables. The challenge of disentangling channel drift from biological aging is also recognized, indicating that further research is needed to fully understand these dynamics.

Broader Impact

The VoxKnesset dataset has significant implications for various applications, including biometric security, automated transcription, and health diagnostics. By addressing the aging of vocal characteristics, this work could lead to more robust and reliable speech processing systems that can adapt to individual changes over time. The dataset's release will likely stimulate further research in Hebrew speech processing and aging-related studies, contributing to the broader field of machine learning and speech technology. The main contribution of this paper is the introduction of VoxKnesset, a large-scale longitudinal Hebrew speech dataset that enables the study of aging effects on speech, along with a comprehensive evaluation of modern speech embeddings in this context. This work represents a significant advancement in the field of speech processing, particularly for underrepresented languages and demographic studies.

Analysis: Full Paper • Full text: 13,945 characters

VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling

Yanir Marmor, Arad Zulti, David Krongauz ... · arXiv

Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 s...

Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.

Institutional Affiliations

Primary: Weizmann Institute of Science

All Institutions: Weizmann Institute of Science

ML Relevance Analysis (83)

The paper introduces VoxKnesset, a comprehensive longitudinal Hebrew speech dataset that addresses the critical challenge of vocal aging in speech processing systems. This work significantly contributes to the field by providing a large-scale resource that enables rigorous longitudinal evaluation and benchmarking of speech models, ultimately advancing the understanding of age-related changes in voice and their implications for technology.

Comprehensive Analysis

Methodology Assessment

The methodology employed in this paper is robust, leveraging a multi-stage alignment pipeline for audio processing and employing modern speech embeddings for age prediction and speaker verification. The use of verified demographic metadata enhances the dataset's reliability, and the longitudinal design addresses a significant gap in existing speech datasets. The experiments are well-structured, comparing various models and providing a comprehensive analysis of their performance over time.

Experimental Evaluation

The experiments are thorough, utilizing a large-scale dataset that spans 16 years and includes a diverse set of speakers. The benchmarking against established models and datasets adds credibility to the findings. The results demonstrate the degradation of speaker verification performance over time and highlight the importance of longitudinal training for capturing aging signals, which is a crucial contribution to the field.

Reproducibility

The paper mentions that the dataset and processing pipeline will be publicly available, which is essential for reproducibility. However, specific implementation details, such as the exact configurations of the models used, are not fully disclosed, which could hinder complete reproducibility.

Limitations

The dataset is limited to a single speech register (parliamentary debate) and may not capture the full diversity of speech variations across different contexts. Additionally, the demographic skew towards older adults and potential changes in recording conditions over the years may affect the generalizability of the findings.

Broader Impact

The implications of this research are significant for various applications, including biometric security, health diagnostics, and aging-aware voice technologies. The dataset can serve as a foundational resource for future research in Hebrew speech processing and aging speaker modeling. The paper introduces VoxKnesset, a comprehensive longitudinal Hebrew speech dataset that addresses the critical challenge of vocal aging in speech processing systems. This work significantly contributes to the field by providing a large-scale resource that enables rigorous longitudinal evaluation and benchmarking of speech models, ultimately advancing the understanding of age-related changes in voice and their implications for technology.

Analysis: Full Paper • Full text: 13,914 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (82)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility