A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level and time differences (ITD/ILD). While Head-Related Transfer Function (HRTF) models based on rigid scattering suffice for terrestrial humans, they fail in underwater environments due to the near-impedance match between water and soft tissue. Motivated by the acoustic anatomy of underwater animals, we introduce a novel, analytically derived, closed-form forward model for scattering from a semi-transparent sphere containing two rigid spherical scatterers. This model accurately maps source direction, frequency, and material properties to the pressure field, capturing the complex physics of layered, penetrable structures. Critically, our model is implemented in a fully differentiable setting, enabling its integration with a machine learning algorithm to optimize a cost function for active localization. We demonstrate enhanced convergence for localization under noise using a physics-informed frequency weighting scheme, and present accurate moving-source tracking via an Extended Kalman Filter (EKF) with analytically computed Jacobians. Our work suggests that differentiable models of scattering from layered rigid and transparent geometries offer a promising new foundation for microphone arrays that leverage scattering-based spatial cues over conventional beamforming, applicable to both terrestrial and underwater applications. Our model will be made open source.
Primary: University of Maryland
All Institutions: University of Maryland, SDU | All: & Reality Lab
The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.
The paper presents a novel analytical framework for modeling sound scattering in underwater environments, utilizing a differentiable multi-sphere scattering model. The approach is grounded in biological principles of spatial hearing and employs multipole expansions to derive a closed-form solution. The implementation in a differentiable programming framework (JAX) allows for efficient gradient-based optimization, which is a significant advancement over traditional methods that do not provide gradients. This differentiability enables the integration of the model with machine learning algorithms for active localization, showcasing a well-thought-out methodology that bridges physics-based modeling and machine learning.
The experiments conducted validate the proposed model through simulations that demonstrate its ability to accurately capture interaural level differences (ILD) and interaural time differences (ITD) under various conditions. The results show that the model effectively generates realistic binaural cues and performs robustly in source localization tasks, even under noise. The use of an Extended Kalman Filter (EKF) for tracking moving sources further emphasizes the practical applicability of the model. The experiments are comprehensive, covering various source directions and noise levels, which strengthens the findings.
While the paper mentions that the model will be made open source, specific details regarding the implementation and access to the code are not provided. This lack of direct access to the code and data limits the reproducibility of the results. However, the detailed methodology and equations presented allow for potential replication by researchers with sufficient expertise in the field.
The primary limitation of the study is the reliance on a simplified geometric model that may not capture all complexities of real-world underwater environments. Additionally, the experiments are conducted in a controlled simulation setting, which may not fully represent the challenges faced in practical applications, such as reverberation and multi-source scenarios. The model's performance in more complex acoustic environments remains to be tested.
This research has significant implications for the development of advanced acoustic sensing systems, particularly in underwater environments where traditional methods struggle. The ability to accurately model sound scattering and utilize spatial cues for localization can enhance various applications, including marine biology research, underwater navigation, and surveillance. The open-source nature of the model could foster further research and development in this area, promoting collaboration and innovation. The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.
Primary: University of Michigan
All Institutions: University of Michigan
The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.
The paper introduces a novel benchmarking framework, Spoof-SUPERB, specifically designed for evaluating self-supervised learning (SSL) models in the context of audio deepfake detection. The methodology is well-structured, utilizing a unified protocol for training and evaluation across multiple datasets, which enhances comparability. The choice of models and the systematic evaluation of their performance under various conditions, including acoustic degradations, is a significant strength. However, the paper could benefit from a more detailed discussion of the specific training and evaluation protocols used, as well as the rationale behind the selection of datasets.
The experiments are comprehensive, involving 20 different SSL models evaluated on multiple datasets, which provides a robust analysis of model performance. The results clearly demonstrate the superiority of large-scale discriminative models over generative ones, particularly in terms of resilience to noise and other acoustic degradations. The use of Equal Error Rate (EER) as a performance metric is appropriate for the task, although additional metrics could provide a more nuanced view of model performance.
The paper emphasizes reproducibility by establishing a fixed training setup and evaluation protocol, which is crucial for benchmarking in machine learning. However, the absence of a publicly accessible code repository or detailed implementation guidelines limits the ability of other researchers to reproduce the results. Providing such resources would significantly enhance the paper's impact.
One limitation is the potential overlap between the pretraining data of some models and the evaluation datasets, which could bias the results. Additionally, while the paper addresses robustness under various acoustic conditions, it does not explore the implications of different synthesis methods for audio deepfakes, which could be a critical area for future research.
The introduction of a standardized benchmark for audio deepfake detection is a timely contribution, given the increasing prevalence of deepfake technologies and their implications for security and trust in audio communications. This work could pave the way for further advancements in antispoofing techniques and the development of more secure speech processing systems. The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.
Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: https://wflrz123.github.io/
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFlytek Co., Ltd.
The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.
The proposed end-to-end simultaneous dysarthric speech reconstruction (E2E-SDSR) system introduces innovative components such as a frame-level adaptor module and a multiple wait-k autoregressive TTS module. The frame-level adaptor effectively bridges the gap between ASR and TTS, enhancing the robustness of the system against ASR errors through explicit-implicit semantic information fusion. The multiple wait-k strategy in the TTS module allows for flexibility in processing, balancing latency and prosody quality. The methodology is well-structured, with a clear focus on addressing the unique challenges posed by dysarthric speech, particularly in terms of intelligibility and naturalness.
The experiments are comprehensive, utilizing both a commercial dysarthric speech dataset and the UASpeech dataset. The reported results, including a mean opinion score (MOS) of 4.67 and a 54.25% reduction in word error rate (WER), demonstrate significant improvements over existing methods. The ablation studies provide valuable insights into the contributions of each component of the proposed system, reinforcing the effectiveness of the adaptor and wait-k strategies.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as hyperparameter settings, training duration, and the exact architecture configurations used. The absence of a publicly available code repository limits the reproducibility of the results.
The study primarily focuses on dysarthric speech and may not generalize well to other speech disorders or languages. Additionally, the reliance on a limited dataset for training and testing could affect the robustness of the model in real-world applications. The paper does not address potential biases in the dataset or the implications of using commercial data.
The proposed system has the potential to significantly improve communication for individuals with dysarthria, enhancing their quality of life and social interactions. By providing a more efficient and intelligible speech reconstruction method, it could be applied in various assistive technologies and communication devices. The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.
Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.
Primary: University of Southern California
All Institutions: University of Southern California
This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.
The methodology presented in this paper is innovative, combining real-time MRI, EEG, and surface EMG to capture the complex dynamics of speech production. The authors have developed a multi-stage denoising pipeline to address significant technical challenges, including electromagnetic interference and myogenic artifacts. The use of canonical correlation analysis (CCA) for artifact removal is particularly noteworthy, as it allows for the effective suppression of non-neural signals while preserving the underlying neural activity. However, the methodology could benefit from further validation across a larger cohort to establish its robustness and generalizability.
The experimental design is well-structured, focusing on a single subject to explore the feasibility of simultaneous data acquisition. The tasks are clearly defined, and the results demonstrate the effectiveness of the artifact removal techniques. However, the reliance on a single participant limits the generalizability of findings. The authors provide thorough comparisons of EEG signals before and after denoising, showcasing significant improvements in signal quality, which is a strong point of the experimental evaluation.
The paper provides detailed descriptions of the experimental setup, data acquisition methods, and artifact correction techniques, which are essential for reproducibility. However, the lack of a publicly available dataset or code repository hinders full reproducibility of the results. Future work should include sharing data and methodologies to allow other researchers to validate and build upon these findings.
The primary limitations include the small sample size (single-subject study), which restricts the ability to generalize findings. Additionally, the use of passive electrodes may introduce higher noise levels compared to active electrodes, potentially affecting data quality. The EEG cap's design may not be optimal for capturing speech-specific brain activity, and residual artifacts from the EMG setup could still influence results. Lastly, the potential impact of scanner noise and visual stimuli on neural activity remains a concern.
This research has significant implications for both speech neuroscience and brain-computer interface (BCI) technologies. By providing a comprehensive view of the neural, muscular, and articulatory components of speech production, the findings could lead to advancements in silent speech interfaces and improved understanding of speech disorders. The methodology could pave the way for future studies exploring the intricacies of speech planning and execution, potentially transforming approaches to speech rehabilitation and communication technologies. This paper presents a pioneering approach to simultaneously capture real-time MRI, EEG, and surface EMG data during speech production, offering valuable insights into the neurophysiological processes underlying speech. The innovative artifact suppression techniques and the potential applications in BCI and speech science highlight its significance in advancing the field.
Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.
Primary: Harvard University
All Institutions: Harvard University, Massachusetts Institute of Technology
The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.
The paper introduces a novel phoneme recognition system, BabAR, tailored for child speech, leveraging a large-scale dataset, TinyVox, which encompasses diverse languages and extensive child vocalizations. The methodology includes pretraining on multilingual data and context-aware fine-tuning, which are innovative approaches in the domain of child speech recognition. The use of Connectionist Temporal Classification (CTC) for sequence-to-sequence tasks is appropriate given the challenges of variable-length outputs in phoneme recognition. The systematic evaluation of different self-supervised models and the exploration of context duration for improving recognition accuracy are well-structured and contribute significantly to the methodology.
The experiments are robust, comparing BabAR against state-of-the-art phoneme recognition systems and demonstrating significant performance improvements. The paper provides detailed error analysis, illustrating that BabAR's substitutions tend to remain within phonetic categories, which is crucial for developmental analyses. The validation of BabAR's performance against a longitudinal dataset supports its practical applicability in developmental research. However, the paper could benefit from more extensive comparisons with existing systems and additional metrics beyond phoneme error rates to fully capture the model's effectiveness.
The authors provide sufficient implementation details, including model architecture, training procedures, and evaluation metrics, which enhance reproducibility. The availability of the dataset and code on GitHub is a significant step towards ensuring that other researchers can replicate the study and build upon it. However, the paper could improve by including more explicit instructions for setting up the environment and dependencies.
The study acknowledges challenges in phonetic transcription, particularly the subjective nature of human annotation and the presence of competing signals in naturalistic recordings. While BabAR shows promise, the reliance on coarse-grained measures for validation may not guarantee accuracy at the individual level, which is critical for clinical applications. Additionally, the dataset's diversity in terms of language and age could introduce variability that may affect generalization.
The development of BabAR and TinyVox has the potential to revolutionize the study of early speech development by enabling large-scale, automated phonetic analysis. This could facilitate early detection of speech and language delays, enhance cross-linguistic studies, and improve educational tools for language learning. The integration of advanced machine learning techniques with developmental science opens up new avenues for research and practical applications in child language acquisition. The paper presents BabAR, a pioneering phoneme recognition system for child speech, demonstrating significant advancements in automatic phonetic analysis through innovative methodology and extensive experimental validation.
We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime'' is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.
Primary: Salesforce AI Research
All Institutions: Salesforce AI Research
The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.
The paper presents a systematic approach to building enterprise-grade realtime voice agents by dissecting the components of speech-to-text (STT), language model (LLM), and text-to-speech (TTS) into a cascaded streaming pipeline. The authors emphasize the importance of streaming and pipelining rather than relying on a single fast model, which is a critical insight for achieving low latency in voice interactions. The tutorial format is effective, providing a step-by-step guide that includes empirical evaluations of various models, thus making the methodology accessible and practical for developers.
The experiments conducted are robust, comparing the performance of native speech-to-speech models against a cascaded pipeline approach. The authors provide detailed latency measurements for each component, demonstrating the effectiveness of their proposed architecture in achieving sub-1-second time-to-first-audio (TTFA). The empirical results are well-documented, showcasing the advantages of their approach in real-world scenarios, which adds credibility to their findings.
The paper includes a comprehensive codebase released as open-source, which is a significant advantage for reproducibility. The detailed tutorial format, along with the release of tested code for each component, allows other researchers and practitioners to replicate the results and build upon the work. However, the paper could benefit from clearer documentation on the specific environments and dependencies required to run the code effectively.
One limitation noted is the reliance on cloud APIs for some components, which may introduce variability in performance due to network latency. Additionally, the findings are based on specific models and configurations, which may not generalize across all potential implementations. The authors also acknowledge that native speech-to-speech models are not yet viable for real-time applications, which highlights the current constraints in the field.
This work has significant implications for the development of voice-based AI agents in enterprise settings, particularly in applications such as customer service, healthcare, and task management. By providing a clear framework and practical guidance, the paper can facilitate the adoption of real-time voice agents, potentially transforming user interactions across various industries. The paper provides a comprehensive tutorial for building enterprise-grade realtime voice agents from scratch, emphasizing the importance of streaming and pipelining in achieving low latency. The technical contributions and methodology are significant, offering valuable insights and practical tools for researchers and practitioners in the field of audio machine learning.
While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the imperceptible waveform variations used in prior watermarking methods. To address this limitation, we propose Latent-Mark, the first zero-bit audio watermarking framework designed to survive semantic compression. Our key insight is that robustness to the encode-decode process requires embedding the watermark within the codec's invariant latent space. We achieve this by optimizing the audio waveform to induce a detectable directional shift in its encoded latent representation, while constraining perturbations to align with the natural audio manifold to ensure imperceptibility. To prevent overfitting to a single codec's quantization rules, we introduce Cross-Codec Optimization, jointly optimizing the waveform across multiple surrogate codecs to target shared latent invariants. Extensive evaluations demonstrate robust zero-shot transferability to unseen neural codecs, achieving state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility. Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
Primary: National Taiwan University
All Institutions: National Taiwan University, CyCraft AI Lab, MoonShine Animation Studio, RIKEN Center for Computational Science
The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.
The methodology presented in Latent-Mark is innovative, leveraging the concept of embedding watermarks within the invariant latent space of neural audio codecs. The approach of optimizing the audio waveform to induce a detectable shift while ensuring imperceptibility is a significant advancement over traditional methods. The introduction of Cross-Codec Optimization is particularly noteworthy, as it addresses the challenge of overfitting to specific codec characteristics, enhancing the generalizability of the watermarking technique across different audio codecs.
The paper provides extensive evaluations demonstrating the robustness of the proposed method against both traditional DSP attacks and neural resynthesis. The experiments are well-structured, showcasing the performance of Latent-Mark in various scenarios, including zero-shot transferability to unseen codecs. The results indicate a strong resilience to attacks while maintaining perceptual quality, which is crucial for practical applications.
The paper lacks detailed implementation specifics, such as code availability or datasets used for training and evaluation, which could hinder reproducibility. Providing a GitHub repository or links to datasets would significantly enhance the reproducibility of the results.
One limitation of the study is the potential dependency on the specific codecs chosen for Cross-Codec Optimization. While the method shows promise, its performance on a broader range of codecs, especially those not included in the training phase, remains to be fully explored. Additionally, the paper does not address the computational complexity of the optimization process, which could impact real-time applications.
The implications of this research are significant, as it opens avenues for secure audio transmission and copyright protection in an era where neural codecs are becoming prevalent. The ability to maintain watermark integrity against advanced generative models could have far-reaching applications in media, entertainment, and digital rights management. The main contribution of this paper is the introduction of Latent-Mark, a novel zero-bit audio watermarking framework that effectively survives neural resynthesis by embedding watermarks within the latent space of audio codecs. This work represents a meaningful advancement in audio watermarking, addressing vulnerabilities posed by modern neural codecs and providing a foundation for future research in universal watermarking techniques.
Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.
Primary: University of Melbourne
All Institutions: University of Melbourne, Republic of Korea, School of Electrical Engineering
The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.
The paper introduces a structured approach to environmental sound deepfake detection (ESDD) through the formulation of a challenge that includes a well-defined dataset (EnvSDD) and evaluation protocols. The methodology is robust, focusing on two distinct tracks that assess generalization across unseen generators and black-box scenarios, which are critical for real-world applications. The use of diverse audio generation models and the emphasis on cross-generator generalization are notable strengths. However, the paper could benefit from a more detailed explanation of the architectural choices made by the top-performing systems.
The experimental evaluation is comprehensive, with a large number of submissions (1,748) from 97 teams, indicating significant interest and engagement in the challenge. The results are systematically presented, showcasing the performance of baseline systems and top submissions across different tracks. The use of the Equal Error Rate (EER) as a metric is appropriate for the task, and the analysis of system design trends provides valuable insights into effective strategies for ESDD.
While the paper mentions the availability of the EnvSDD dataset and the challenge results, it lacks detailed implementation specifics that would facilitate reproducibility. The inclusion of code repositories or links to the actual implementations of the top-performing systems would enhance reproducibility and allow other researchers to build upon this work.
One limitation is the potential overfitting of models to specific generators, as indicated by performance degradation on unseen generators. Additionally, the challenge does not address the potential for adversarial attacks on detection systems, which could be a significant concern in practical applications. The reliance on a specific evaluation metric (EER) may also limit the understanding of model performance across different contexts.
The implications of this work are significant, as it addresses a growing concern in the realm of audio deepfakes, which can have serious consequences for public safety and misinformation. The establishment of a benchmark for ESDD could catalyze further research and development in this area, leading to more robust detection systems that can be applied in various real-world scenarios, including security and media verification. The paper presents the first ESDD challenge, providing a foundational framework for advancing the detection of environmental sound deepfakes. Its comprehensive methodology, extensive experimental results, and insights into future research directions mark a significant contribution to the field of audio deepfake detection.
Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
Primary: Shanghai Jiao Tong University
All Institutions: National Taiwan University, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.
The paper introduces a robust methodology for constructing a large-scale audio-text dataset, TW-Sound580K, specifically targeting the unique acoustic characteristics of Taiwanese dialects. The Verify-Generate-Critique (VGC) protocol is a notable innovation, effectively addressing the challenges of data curation in a linguistically diverse context. The integration of Dual-ASR validation to filter and enhance the dataset quality is commendable, as it mitigates the risks of hallucinations in audio transcription. The dynamic Dual-ASR Arbitration mechanism further strengthens the inference process by selecting the most accurate transcription based on acoustic-conditioned perplexity, showcasing a thoughtful approach to model adaptation.
The experimental validation of the Tai-LALM model on the TAU Benchmark demonstrates a significant performance improvement over the baseline, achieving 49.1% accuracy. This empirical evidence supports the effectiveness of the proposed dataset and methodology. The paper includes a comprehensive ablation study that isolates the contributions of various components, reinforcing the robustness of the findings. However, the reliance on a single benchmark may limit the generalizability of the results.
The authors provide a clear outline of their methodology, including the dataset construction process and the training setup for Tai-LALM. However, the lack of direct access to the raw audio data due to copyright constraints poses challenges for full reproducibility. The mention of providing source URLs and metadata upon de-anonymization is a positive step towards enabling future research.
The paper acknowledges several limitations, including the empirical nature of the VGC curation threshold, which may require recalibration for different regions. Additionally, the latency and VRAM overhead introduced by the Dual-ASR arbitration could hinder deployment in resource-constrained environments. The evaluation primarily focuses on the TAU Benchmark, which may not capture the full spectrum of performance across diverse acoustic scenarios.
This work has significant implications for the development of localized audio-language models, particularly in under-resourced linguistic regions. By addressing the localization gap, the proposed dataset and methodologies can enhance the performance of LALMs in understanding regional dialects and acoustic features. The framework established in this paper could serve as a model for similar efforts in other culturally rich but underrepresented areas. The main contribution of this paper is the introduction of TW-Sound580K, a specialized audio-text dataset, and the innovative methodologies for its curation and model adaptation, which significantly enhance the performance of audio-language models in localized contexts. The comprehensive approach to dataset construction and inference optimization represents a meaningful advancement in the field of machine learning for audio processing.
Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method's enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.
The proposed ASR-TRA framework introduces a novel approach to test-time adaptation (TTA) in automatic speech recognition (ASR) by leveraging reinforcement learning (RL) and causal interventions. The methodology is well-structured, utilizing a learnable decoder prompt and temperature-controlled stochastic decoding to generate diverse transcription candidates. The integration of a reward model based on audio-text semantic alignment is a significant innovation that addresses the limitations of existing TTA methods, which often rely on pseudo-labeling or entropy minimization. The use of a Structural Causal Model (SCM) to formalize the adaptation process adds rigor to the approach, although the paper could benefit from a more detailed explanation of the causal relationships involved.
The experiments conducted on the LibriSpeech and L2 Arctic datasets demonstrate the effectiveness of ASR-TRA in improving ASR robustness against noise and accent variations. The results indicate a significant reduction in word error rates (WER) compared to existing TTA methods, showcasing the practical applicability of the proposed framework. The ablation studies provide valuable insights into the contributions of different components, confirming the importance of both prompt tuning and reward modeling. However, the paper could enhance its experimental evaluation by including more diverse datasets and real-world scenarios to further validate the robustness of the method.
The paper provides sufficient details regarding the implementation of ASR-TRA, including the architecture, datasets, and evaluation metrics. The inclusion of hyperparameters and specific configurations aids in reproducibility. However, the lack of a comprehensive description of the training process and the absence of a public demo could hinder full reproducibility for other researchers.
One limitation of the proposed method is its reliance on the CLAP reward model, which may not generalize well across all types of audio inputs. Additionally, while the method shows improvements in accuracy and latency, the computational cost associated with generating multiple candidates and evaluating them could be a concern in resource-constrained environments. The paper also does not address potential scalability issues when deploying the model in real-time applications.
The ASR-TRA framework has the potential to significantly enhance the robustness of ASR systems in real-world applications, particularly in environments with high noise levels or diverse accents. This could lead to improved accessibility and user experience in various domains, including voice-activated assistants, transcription services, and communication aids for individuals with speech impairments. The focus on test-time adaptation without requiring ground-truth labels is particularly relevant for applications where labeled data is scarce or unavailable. The main contribution of this paper is the introduction of ASR-TRA, a novel test-time reinforcement adaptation framework that enhances the robustness of automatic speech recognition systems through causal interventions and semantic reward modeling. This work represents a significant step forward in addressing the challenges of deploying ASR systems in real-world conditions, providing a practical solution that balances accuracy and efficiency.
Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.
Primary: University of Rajshahi
All Institutions: University of Rajshahi, Anan National College of Technology
The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.
The methodology is robust, employing a systematic empirical study to evaluate the impact of SAM-Audio on zero-shot ASR performance across two distinct datasets. The authors clearly outline their preprocessing pipeline, ASR models, and evaluation metrics, ensuring that the study is well-structured and reproducible. However, the reliance on a single variant of SAM-Audio due to computational constraints may limit the generalizability of the findings.
The experiments are comprehensive, covering multiple Whisper model variants and two linguistically diverse datasets. The use of WER and CER as primary metrics is appropriate for assessing ASR performance. The results consistently demonstrate that SAM-Audio preprocessing degrades ASR performance, which is a significant finding that challenges existing assumptions in the field.
The paper provides sufficient detail regarding the experimental setup, including datasets and evaluation protocols, which facilitates reproducibility. However, the lack of access to the SAM-Audio model variants used in the experiments may hinder full reproducibility for other researchers.
The study is limited by the use of only the SAM-Audio Small variant and the focus on zero-shot ASR, which may not capture the full potential of the enhancement model. Additionally, the analysis is based on two datasets, which may not encompass the full range of real-world acoustic conditions.
This research has significant implications for the field of ASR and speech enhancement, as it highlights the risks of applying denoising techniques without considering their impact on recognition accuracy. The findings encourage a reevaluation of preprocessing strategies in ASR systems, particularly in zero-shot settings. The main contribution of this paper is the critical examination of the assumption that improving perceptual audio quality through denoising enhances ASR performance, revealing that such enhancements can actually degrade recognition accuracy in zero-shot ASR contexts. This comprehensive analysis challenges prevailing notions and underscores the need for ASR-aware approaches to speech preprocessing, thereby advancing the understanding of the interplay between audio quality and machine recognition.
Speech deepfake detection (SDD) is essential for maintaining trust in voice-driven technologies and digital media. Although recent SDD systems increasingly rely on self-supervised learning (SSL) representations that capture rich contextual information, complementary signal-driven acoustic features remain important for modeling fine-grained structural properties of speech. Most existing acoustic front ends are based on time-frequency representations, which do not fully exploit higher-order spectral dependencies inherent in speech signals. We introduce a cyclostationarity-inspired acoustic feature extraction framework for SDD based on spectral correlation density (SCD). The proposed features model periodic statistical structures in speech by capturing spectral correlations between frequency components. In particular, we propose temporally structured SCD features that characterize the evolution of spectral and cyclic-frequency components over time. The effectiveness and complementarity of the proposed features are evaluated using multiple countermeasure architectures, including convolutional neural networks, SSL-based embedding systems, and hybrid fusion models. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5 demonstrate that SCD-based features provide complementary discriminative information to SSL embeddings and conventional acoustic representations. In particular, fusion of SSL and SCD embeddings reduces the equal error rate on ASVspoof 2019 LA from $8.28\%$ to $0.98\%$, and yields consistent improvements on the challenging ASVspoof 5 dataset. The results highlight cyclostationary signal analysis as a theoretically grounded and effective front end for speech deepfake detection.
Primary: Bursa Technical University
All Institutions: Bursa Technical University, TCG CREST, University of Eastern Finland
The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.
The paper introduces a novel cyclostationarity-inspired feature extraction framework for speech deepfake detection (SDD) that leverages spectral correlation density (SCD) to capture periodic statistical structures in speech. The methodology is well-grounded in signal processing theory, addressing the limitations of conventional time-frequency representations. The proposed two-dimensional SCD features are designed to incorporate temporal dynamics, which enhances their discriminative power. The use of multiple countermeasure architectures, including convolutional neural networks and self-supervised learning embeddings, demonstrates a comprehensive approach to evaluating the effectiveness of the proposed features.
The experiments are robust, utilizing three challenging datasets (ASVspoof 2019 LA, ASVspoof 2021 DF, and ASVspoof 5) to validate the proposed features. The results indicate significant improvements in equal error rates when combining SCD features with SSL embeddings, showcasing the complementarity of the approaches. The experimental setup is thorough, with clear metrics for performance evaluation (EER and minDCF), and the results are presented in a manner that highlights the advantages of the proposed methods over existing baselines.
The paper provides sufficient detail regarding the experimental setup, including datasets, feature extraction methods, and model architectures. However, the absence of a publicly available code repository limits the reproducibility of the results. The authors do provide a demo URL for synthesized speech, which is beneficial but does not fully compensate for the lack of code.
One limitation is the reliance on specific datasets, which may not capture the full diversity of speech deepfake scenarios. Additionally, while the results are promising, the paper does not address potential overfitting issues or the generalizability of the models to unseen spoofing techniques. The computational complexity of the SCD feature extraction process may also pose challenges for real-time applications.
The proposed methodology has significant implications for enhancing the security and trustworthiness of voice-driven technologies, particularly in applications like audio forensics and telecommunication security. By improving the detection of speech deepfakes, the research contributes to the broader field of audio signal processing and machine learning, addressing a critical need in the era of advanced synthetic media. The main contribution of this paper is the introduction of a cyclostationarity-based feature extraction framework for speech deepfake detection, which significantly enhances the detection capabilities by capturing spectral correlations that are often overlooked by conventional methods. This work represents a meaningful advancement in the field of audio signal processing and machine learning, particularly in the context of combating the growing threat of synthetic audio content.
Generative audio requires fine-grained controllable outputs, yet most existing methods require model retraining on specific controls or inference-time controls (\textit{e.g.}, guidance) that can also be computationally demanding. By examining the bottlenecks of existing guidance-based controls, in particular their high cost-per-step due to decoder backpropagation, we introduce a guidance-based approach through selective TFG and Latent-Control Heads (LatCHs), which enables controlling latent audio diffusion models with low computational overhead. LatCHs operate directly in latent space, avoiding the expensive decoder step, and requiring minimal training resources (7M parameters and $\approx$ 4 hours of training). Experiments with Stable Audio Open demonstrate effective control over intensity, pitch, and beats (and a combination of those) while maintaining generation quality. Our method balances precision and audio fidelity with far lower computational costs than standard end-to-end guidance. Demo examples can be found at https://zacharynovack.github.io/latch/latch.html.
Primary: UC -- San Diego
All Institutions: UC -- San Diego
The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
The paper introduces a novel approach to controllable audio generation through the use of Latent-Control Heads (LatCHs) and selective Training-Free Guidance (TFG). By operating directly in latent space, the proposed method significantly reduces computational overhead associated with traditional end-to-end guidance methods. The methodology is well-structured, with clear explanations of how LatCHs function and the rationale behind selective TFG. The authors provide a solid theoretical foundation, linking their work to existing literature while clearly delineating their contributions.
The experiments are comprehensive, utilizing the Stable Audio Open (SAO) dataset and comparing the proposed methods against established baselines, including end-to-end guidance and readouts. The evaluation metrics are well-defined, including both qualitative assessments (mean opinion scores) and quantitative metrics (FDopenl3, KLpasst, and CLAP). The results demonstrate that LatCHs outperform traditional methods in terms of both audio quality and computational efficiency, which is a significant achievement in the field of audio generation.
The paper provides sufficient details regarding the experimental setup, including hyperparameters and training procedures for LatCHs. However, the lack of a publicly available code repository may hinder full reproducibility. The authors do mention the datasets used, which aids in replicating the experiments, but the absence of a project URL limits access to the implementation.
One limitation is the potential challenge in generalizing the method to more complex audio generation tasks beyond the evaluated controls (intensity, pitch, and beats). Additionally, the reliance on specific feature extractors may limit the applicability of the approach to other audio domains. The authors also note that controls with greater variability, such as pitch, pose challenges, indicating room for improvement in handling such cases.
The proposed framework has significant implications for the field of generative audio, particularly in applications requiring real-time audio manipulation and control. The ability to generate high-quality audio with low computational costs can benefit various industries, including music production, gaming, and virtual reality. Furthermore, the approach could pave the way for more accessible audio generation tools for creators without extensive computational resources. The main contribution of this paper is the introduction of a low-resource, inference-time control framework for latent audio diffusion models, which effectively balances control precision, audio fidelity, and runtime performance. The methodology and results presented are significant advancements in the field of controllable audio generation, showcasing the potential for efficient and high-quality audio synthesis.
Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition performance and shown effectiveness for AVSR. However, prior approaches project audio and visual features independently or apply shallow fusion, limiting cross-modal alignment and complementary exchange while increasing the LLM's computational load. To address this, we propose AVUR-LLM, an LLM-based Audio-Visual Speech Recognition via Sparse Modality Alignment and Visual Unit-Guided Refinement. Experiments on LRS3 demonstrate state-of-the-art results for AVSR. Under additive-noise conditions at 0 dB SNR, it achieves 37% relative improvement over the baseline system.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, The Chinese University of Hong Kong, Wuhan University
The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.
The proposed methodology introduces several innovative components such as Sparse Modality Alignment (SMA), Adaptive Modulated Fusion (AMF), and Visual Unit-Guided Refinement (VUR). SMA allows for a more controlled interaction between audio and visual modalities by inserting alignment blocks into the audio encoder, which is a significant improvement over existing methods that typically rely on shallow fusion. The AMF component intelligently modulates visual feature injection based on acoustic reliability, enhancing the model's adaptability to varying input conditions. The VUR approach effectively transforms visual representations into discrete tokens for LLM rescoring, which is a novel strategy that leverages the strengths of both visual and language models. Overall, the methodology is well-structured and addresses key limitations in prior AVSR systems.
The experiments conducted on the LRS3 dataset demonstrate the effectiveness of the proposed model, achieving state-of-the-art results in various noise conditions. The reported 37% relative improvement in Word Error Rate (WER) under 0 dB SNR conditions is particularly noteworthy, showcasing the robustness of the model in challenging scenarios. The ablation studies provide additional insights into the contributions of each component, reinforcing the validity of the proposed framework. However, the paper could benefit from a more detailed discussion on the statistical significance of the results and comparisons with a broader range of existing methods.
The paper provides a comprehensive overview of the experimental setup, including details on the dataset, model architecture, training procedures, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider making the implementation accessible to facilitate validation by the research community.
One limitation of the study is the reliance on a single dataset (LRS3) for evaluation, which may not fully capture the generalizability of the model across different domains or languages. Additionally, while the method shows improvements in noise robustness, the paper does not explore the performance in extremely adverse conditions or with diverse accents and speech patterns. The computational efficiency of the proposed model, particularly in real-time applications, is also not thoroughly addressed.
The advancements in AVSR presented in this paper have significant implications for various applications, including assistive technologies for the hearing impaired, video conferencing systems, and automated transcription services. By enhancing the robustness of speech recognition in noisy environments, this research contributes to making communication technologies more accessible and effective. The main contribution of this paper is the introduction of AVUR-LLM, a novel framework for audio-visual speech recognition that leverages sparse modality alignment and visual unit-guided refinement to achieve state-of-the-art performance in challenging acoustic conditions. This work significantly advances the field of AVSR by addressing key limitations of existing methods and demonstrating the potential for improved robustness and accuracy in speech recognition tasks.
Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However, existing embedding-based approaches almost exclusively rely on temporal mean pooling, while alternative pooling strategies have so far only been explored for spectrogram-based representations. Consequently, the role of temporal pooling in training-free ASD with pre-trained embeddings remains insufficiently understood. In this paper, we present a systematic evaluation of temporal pooling strategies across multiple state-of-the-art audio embedding models. We propose relative deviation pooling (RDP), an adaptive pooling method that emphasizes informative temporal deviations, and introduce a hybrid pooling strategy that combines RDP with generalized mean pooling. Experiments on five benchmark datasets demonstrate that the proposed methods consistently outperform mean pooling and achieve state-of-the-art performance for training-free ASD, including results that surpass all previously reported trained systems and ensembles on the DCASE2025 ASD dataset.
Primary: Aalborg University
All Institutions: Aalborg University, Pioneer Centre for Artificial Intelligence
The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.
The paper introduces relative deviation pooling (RDP) and a hybrid pooling strategy that combines RDP with generalized mean pooling (GEM). This approach emphasizes informative temporal deviations, addressing a significant gap in the current understanding of temporal pooling in training-free anomalous sound detection (ASD). The systematic evaluation of various pooling strategies across multiple state-of-the-art audio embedding models is a strong methodological contribution, as it not only highlights the importance of pooling mechanisms but also provides a framework for future research in this area.
The experiments are conducted on five benchmark datasets, demonstrating that the proposed methods consistently outperform traditional mean pooling and achieve state-of-the-art performance for training-free ASD. The results are rigorously analyzed, showing significant improvements over existing methods, including previously reported trained systems. The paper includes comprehensive comparisons and ablation studies, validating the effectiveness of the proposed pooling strategies.
The paper provides detailed descriptions of the datasets, experimental setup, and evaluation metrics, which enhances reproducibility. However, the absence of publicly available code or demo URLs limits the ability for others to directly replicate the findings. The authors mention the use of specific hyperparameters but do not provide a repository for the implementation, which could be a barrier for reproducibility.
One limitation is the reliance on pre-trained audio embedding models, which may not generalize well to all types of anomalous sounds. Additionally, while the proposed pooling strategies show significant improvements, the paper does not explore the potential of integrating these methods into supervised or semi-supervised frameworks, which could further enhance performance. The focus on training-free methods may also limit applicability in scenarios where labeled data is available.
The findings have significant implications for real-world applications in anomaly detection, particularly in industrial settings where rapid deployment and robustness to domain shifts are critical. The proposed methods could lead to more effective monitoring systems for machinery and environmental sounds, potentially reducing downtime and improving safety. The emphasis on training-free approaches also opens avenues for applications in resource-constrained environments. The paper presents a novel exploration of temporal pooling strategies in training-free anomalous sound detection, significantly advancing the understanding of this critical component in audio processing pipelines. The systematic evaluation and introduction of innovative pooling methods contribute valuable insights and methodologies that can influence future research and applications in the field.
Whispered-to-normal (W2N) speech conversion aims to reconstruct missing phonation from whispered input while preserving content and speaker identity. This task is challenging due to temporal misalignment between whisper and voiced recordings and lack of paired data. We propose FlowW2N, a conditional flow matching approach that trains exclusively on synthetic, time-aligned whisper-normal pairs and conditions on domain-invariant features. We exploit high-level ASR embeddings that exhibits strong invariance between synthetic and real whispered speech, enabling generalization to real whispers despite never observing it during training. We verify this invariance across ASR layers and propose a selection criterion optimizing content informativeness and cross-domain invariance. Our method achieves SOTA intelligibility on the CHAINS and wTIMIT datasets, reducing Word Error Rate by 26-46% relative to prior work while using only 10 steps at inference and requiring no real paired data.
Primary: &D Institute UK (SRUK)
All Institutions: &D Institute UK (SRUK), Mobile eXperience Business, Republic of Korea
The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.
The proposed FlowW2N method introduces a novel conditional flow matching approach that effectively addresses the challenges of whispered-to-normal speech conversion, particularly the temporal misalignment and lack of paired data. By leveraging synthetic data and domain-invariant ASR embeddings, the authors successfully sidestep traditional alignment issues, which is a significant advancement in the field. The architecture employs a Diffusion Transformer and utilizes a Gaussian prior for generation, which is innovative and well-justified. The methodology is clearly articulated, with a systematic exploration of different conditioning mechanisms and layer selection criteria that enhance the model's performance.
The experiments are comprehensive, utilizing two well-established datasets (CHAINS and wTIMIT) to evaluate the model's performance. The results demonstrate a significant reduction in Word Error Rate (WER) compared to prior methods, achieving state-of-the-art intelligibility. The paper includes ablation studies that provide insights into the contributions of various components of the model, reinforcing the robustness of the findings. The evaluation metrics are appropriate and well-defined, ensuring that the results are credible and reproducible.
While the paper provides a detailed description of the methodology and experimental setup, it lacks a publicly available code repository or demo URL, which hinders reproducibility. The authors mention using internal generative AI tools for language refinement, but there is no indication of whether the model or data will be made available for further research.
One limitation is the reliance on synthetic data for training, which may not fully capture the complexities of real-world whispered speech. Additionally, while the model shows impressive performance on the evaluated datasets, its generalizability to other languages or dialects is not addressed. The absence of a demo or code repository also limits the accessibility of the research for further validation by the community.
The implications of this research are significant, particularly in applications involving speech recognition and synthesis for individuals with speech impairments or in noisy environments. The ability to convert whispered speech to normal speech could enhance communication for those who rely on whispering due to various reasons, thus broadening accessibility in technology. The main contribution of this paper is the introduction of FlowW2N, a novel approach for whispered-to-normal speech conversion that achieves state-of-the-art performance by leveraging synthetic data and domain-invariant features. This work represents a meaningful advancement in the field of audio processing and speech synthesis, addressing critical challenges in speech intelligibility and quality.
Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Code and models will be released upon acceptance.
Primary: NVIDIA
All Institutions: NVIDIA, Academia Sinica, Taipei, Taiwan
The main contribution of this paper is the introduction of a novel approach to Universal Speech Enhancement that significantly improves speech quality and ASR performance by rethinking training targets and leveraging a two-stage model framework. This work addresses critical gaps in the field and provides a solid foundation for future research and applications in speech processing.
The paper presents a systematic approach to Universal Speech Enhancement (USE) by addressing three critical challenges: training target selection, the distortion-perception tradeoff, and data quality. The authors propose using time-shifted anechoic clean speech as a learning target, which is shown to outperform conventional early-reflected speech. They also introduce a two-stage framework that combines regression and generative models to balance fidelity and perceptual quality effectively. This methodology is well-grounded in theoretical principles and is supported by empirical evidence.
The experiments are comprehensive, utilizing the URGENT 2025 Challenge dataset, which includes diverse speech distortions and languages. The authors provide detailed results that demonstrate significant improvements in both perceptual quality and automatic speech recognition (ASR) performance. The evaluation metrics are robust, covering both intrusive and non-intrusive measures, which strengthens the validity of their findings.
The authors commit to releasing their code and models upon acceptance, which is a positive step towards reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, such as hyperparameters and specific training procedures, to facilitate replication.
One notable limitation is the reliance on the URGENT 2025 Challenge dataset, which may not fully represent real-world conditions. Additionally, while the proposed method shows improvements, the paper does not extensively discuss scenarios where the model may fail or the potential for overfitting to the training data.
The advancements in speech enhancement have significant implications for various applications, including telecommunications, assistive technologies, and improving the quality of training data for text-to-speech systems. The language-agnostic nature of the proposed method could also benefit low-resource languages, enhancing accessibility and communication. The main contribution of this paper is the introduction of a novel approach to Universal Speech Enhancement that significantly improves speech quality and ASR performance by rethinking training targets and leveraging a two-stage model framework. This work addresses critical gaps in the field and provides a solid foundation for future research and applications in speech processing.
This paper presents a simulation-based approach to own voice detection (OVD) in hearing aids using a single microphone. While OVD can significantly improve user comfort and speech intelligibility, existing solutions often rely on multiple microphones or additional sensors, increasing device complexity and cost. To enable ML-based OVD without requiring costly transfer-function measurements, we propose a data augmentation strategy based on simulated acoustic transfer functions (ATFs) that expose the model to a wide range of spatial propagation conditions. A transformer-based classifier is first trained on analytically generated ATFs and then progressively fine-tuned using numerically simulated ATFs, transitioning from a rigid-sphere model to a detailed head-and-torso representation. This hierarchical adaptation enabled the model to refine its spatial understanding while maintaining generalization. Experimental results show 95.52% accuracy on simulated head-and-torso test data. Under short-duration conditions, the model maintained 90.02% accuracy with one-second utterances. On real hearing aid recordings, the model achieved 80% accuracy without fine-tuning, aided by lightweight test-time feature compensation. This highlights the model's ability to generalize from simulated to real-world conditions, demonstrating practical viability and pointing toward a promising direction for future hearing aid design.
Primary: Victoria University of Wellington
All Institutions: Victoria University of Wellington, GN ReSound
The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.
The paper proposes a novel approach to own voice detection (OVD) in hearing aids using a single microphone by leveraging simulated acoustic transfer functions (ATFs) for data augmentation. The methodology is well-structured, involving a two-stage simulation-based ATF generation pipeline that transitions from a rigid-sphere model to a detailed head-and-torso representation. The use of a transformer-based classifier enhances the model's ability to learn from spatial propagation cues, which is a significant advancement over traditional methods that rely on multiple microphones or complex signal processing techniques. The hierarchical adaptation strategy employed to progressively fine-tune the model is a commendable aspect, allowing for improved generalization from simulated to real-world conditions.
The experimental results demonstrate high accuracy rates, achieving 95.52% on simulated head-and-torso test data and 80% on real hearing aid recordings without fine-tuning. The use of diverse datasets, including VoxCeleb1 and LibriSpeech, alongside real-world recordings, adds robustness to the evaluation. The model's performance under varying noise conditions was also assessed, showcasing its resilience, which is crucial for practical applications in hearing aids.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific URLs or repositories for code and data, which would enhance reproducibility. The absence of a demo or project URL limits the ability for other researchers to replicate the findings directly. However, the comprehensive description of the data augmentation process and model training strategies offers a solid foundation for future implementations.
One limitation is the reliance on simulated data for training, which may not fully capture the complexities of real-world acoustic environments. The model's performance on real recordings, while promising, may still be affected by factors not accounted for in the simulations. Additionally, the study focuses on offline segment-level detection, leaving out considerations for real-time applications, which are critical in hearing aid technology.
The proposed method has significant implications for the design of hearing aids, particularly in enhancing user comfort and speech intelligibility without increasing device complexity or cost. By enabling effective OVD with a single microphone, this research could lead to more accessible hearing aid solutions for individuals with hearing impairments, potentially improving their quality of life. The main contribution of this work is the introduction of a simulation-based framework for single microphone own voice detection in hearing aids, which effectively utilizes simulated acoustic transfer functions to enhance model training and generalization. This innovative approach not only addresses existing challenges in OVD but also sets a promising direction for future advancements in hearing aid technology.
Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.
Primary: PARLO Institute for Research and Teaching in Speech Therapy
All Institutions: PARLO Institute for Research and Teaching in Speech Therapy
The main contribution of this paper is the introduction of the PARLO Dementia Corpus, a clinically validated German resource for Alzheimer's disease research, which addresses a critical gap in available datasets for non-English languages. This work provides a comprehensive framework for future studies in speech-based cognitive assessment and establishes a benchmark for multi-modal research in neurodegenerative diseases.
The methodology presented in the paper is robust, involving the collection of a diverse dataset from multiple centers, which enhances the generalizability of the findings. The use of a standardized test battery for eliciting speech data is a significant strength, as it allows for systematic comparisons across different cognitive tasks. The detailed transcription process and the inclusion of demographic and clinical metadata further enrich the dataset, making it a valuable resource for future research. The integration of automatic speech recognition (ASR) systems and large language models (LLMs) for cognitive assessment demonstrates a forward-thinking approach, leveraging current advancements in AI.
The experiments conducted provide a solid foundation for evaluating the utility of the PARLO Dementia Corpus. The ASR benchmarking results are particularly noteworthy, showing a clear correlation between cognitive status and transcription accuracy. The automatic test evaluation results validate the effectiveness of the proposed scoring methods, achieving high correlation coefficients with human evaluations. The LLM-based classification experiments illustrate the potential for automated cognitive assessment, although the zero-shot classification approach may benefit from further refinement and validation.
The paper outlines a clear methodology for data collection, transcription, and experimental setup, which supports reproducibility. However, the lack of publicly available code or a project URL limits the ease with which other researchers can replicate the experiments. Providing access to the dataset and a detailed description of the experimental setup would enhance reproducibility.
One limitation of the study is the relatively small sample size of 208 participants, which may affect the statistical power of the findings. Additionally, while the dataset is a significant step forward for German-language research, it may not fully capture the diversity of speech patterns across different demographics or regions within Germany. The reliance on ASR systems also introduces potential biases, as these systems may struggle with disordered speech typical of dementia patients.
The PARLO Dementia Corpus has the potential to significantly impact the field of cognitive impairment research, particularly in non-English speaking populations. It opens avenues for the development of automated screening tools that could facilitate early detection of Alzheimer's disease, ultimately improving patient outcomes. The dataset's compatibility with existing English-language resources enhances its utility for cross-lingual research, promoting a more inclusive approach to cognitive health studies. The main contribution of this paper is the introduction of the PARLO Dementia Corpus, a clinically validated German resource for Alzheimer's disease research, which addresses a critical gap in available datasets for non-English languages. This work provides a comprehensive framework for future studies in speech-based cognitive assessment and establishes a benchmark for multi-modal research in neurodegenerative diseases.
Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6\%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power (\(AUC\): 0.769 and 0.746, respectively) and substantial specificity disparities (\(Δ_{spec}\) up to 18\% and 15\%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.
Primary: University of Antioquia
All Institutions: University of Antioquia, Technische Hochschule Nürnberg, Friedrich-Alexander Universität Erlangen-Nürnberg
This study systematically investigates bias in self-supervised acoustic representations for cognitive impairment detection, revealing significant performance disparities across demographic and clinical subgroups. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights into the fairness and reliability of machine learning models in clinical applications.
The methodology is robust, employing a systematic bias analysis of acoustic representations for cognitive impairment detection. The comparison of traditional acoustic features with self-supervised embeddings from Wav2Vec 2.0 is well-structured, and the evaluation across demographic and clinical subgroups adds significant depth to the analysis. The use of multiple classifiers and detailed bias metrics enhances the rigor of the methodology.
The experiments are comprehensive, utilizing a well-defined dataset (DementiaBank Pitt Corpus) and addressing class imbalance through various balancing strategies. The results are clearly presented, demonstrating the performance of different acoustic features and classifiers, with a focus on subgroup-specific metrics. However, the performance for depression detection is notably lower, which raises questions about the model's effectiveness in this area.
The paper provides a GitHub repository with source code and audio filenames, which supports reproducibility. However, the raw audio recordings cannot be shared due to restrictions, which may limit the ability of other researchers to fully replicate the study.
The study is limited by its reliance on a single dataset, which may not capture the full demographic and linguistic diversity of clinical populations. Additionally, the small number of depression-labeled samples in the dataset may affect the robustness of the conclusions regarding depression classification.
The findings highlight critical issues related to bias and fairness in machine learning applications in healthcare, particularly in speech-based diagnostics for cognitive impairment. The work underscores the importance of fairness-aware evaluation protocols, which could influence future research and clinical practices in AI-driven healthcare solutions. This study systematically investigates bias in self-supervised acoustic representations for cognitive impairment detection, revealing significant performance disparities across demographic and clinical subgroups. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights into the fairness and reliability of machine learning models in clinical applications.
Early and accessible detection of Alzheimer's disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.
Primary: PARLO Institute for Research and Teaching in Speech Therapy
All Institutions: PARLO Institute for Research and Teaching in Speech Therapy
The paper introduces the PARLO Dementia Corpus, a pioneering resource for Alzheimer's disease research in German, enabling innovative approaches to cognitive assessment through speech analysis. The comprehensive dataset and its validation through rigorous experiments position it as a valuable contribution to the fields of speech technology and clinical neuroscience.
The methodology is robust, involving a multi-center design that ensures diversity in the dataset. The use of standardized neuropsychological tasks for speech elicitation is a significant strength, as it allows for a comprehensive assessment of cognitive function. The detailed transcription process enhances the dataset's utility for various analyses. However, the reliance on a specific demographic (German-speaking individuals) may limit generalizability to other languages and cultures.
The experiments conducted, including ASR benchmarking and LLM-based classification, are well-structured and demonstrate the dataset's applicability. The results indicate a clear correlation between automatic evaluations and human assessments, validating the dataset's potential for clinical applications. The choice of models and the evaluation metrics used are appropriate, though further exploration of different ASR systems could enhance understanding of performance across varied conditions.
The paper provides sufficient detail regarding the data collection, transcription, and experimental setup, which supports reproducibility. However, the absence of publicly accessible code or a demo for the models used limits the ease with which other researchers can replicate the findings.
The study's limitations include the potential biases inherent in a multi-center study, such as variability in participant recruitment and testing conditions. Additionally, while the dataset is comprehensive, it may not capture the full spectrum of cognitive impairment across different languages or cultural contexts. The focus on German may restrict broader applicability.
The PARLO Dementia Corpus has significant implications for both clinical and research applications in the field of cognitive impairment. By providing a publicly available dataset, it facilitates advancements in automatic speech recognition and cognitive assessment tools, potentially leading to earlier detection and better management of Alzheimer's disease. The corpus also sets a precedent for future multilingual studies in speech analysis related to cognitive health. The paper introduces the PARLO Dementia Corpus, a pioneering resource for Alzheimer's disease research in German, enabling innovative approaches to cognitive assessment through speech analysis. The comprehensive dataset and its validation through rigorous experiments position it as a valuable contribution to the fields of speech technology and clinical neuroscience.
Building speech deepfake detection models that are generalizable to unseen attacks remains a challenging problem. Although the field has shifted toward a pre-training and fine-tuning paradigm using speech foundation models, most approaches rely solely on supervised fine-tuning (SFT). Inspired by the field of large language models, wherein reinforcement learning (RL) is used for model fine-tuning, we investigate the impact of RL, specifically Group Relative Policy Optimization (GRPO). The results from experiments using multiple detectors and test sets indicate that pure GRPO-based fine-tuning improves performance on out-of-domain test sets while maintaining performance on target-domain test data. This approach outperforms both SFT-only and hybrid setups. Our ablation studies further suggest that the negative reward in GRPO may be a key factor in this improvement.
Primary: National Institute of Informatics
All Institutions: National Institute of Informatics
The main contribution of this paper is the introduction of a reinforcement learning-based fine-tuning approach for speech deepfake detection, demonstrating improved generalization capabilities compared to traditional supervised methods. This work significantly advances the understanding of model training in the context of speech deepfake detection and opens avenues for future research in applying reinforcement learning techniques to other domains within machine learning.
The methodology employed in this paper is robust, leveraging a novel approach of applying Group Relative Policy Optimization (GRPO) for fine-tuning speech deepfake detection models. The authors effectively draw parallels between the fine-tuning processes in speech and large language models, providing a clear rationale for their choice of GRPO over traditional supervised fine-tuning (SFT). The detailed description of the training paradigm, including the formulation of the loss functions and the experimental setup, demonstrates a comprehensive understanding of the problem space. However, the paper could benefit from clearer visual aids and more explicit definitions of the terms used in the equations, which may enhance reader comprehension.
The experimental evaluation is thorough, utilizing multiple detectors and diverse test sets to assess the performance of GRPO against SFT and hybrid setups. The results are well-presented, showing clear improvements in out-of-domain generalization without sacrificing in-domain performance. The use of ablation studies to isolate the effects of the negative reward and regularization term adds depth to the analysis. However, the paper lacks a comparative analysis with other state-of-the-art methods in speech deepfake detection, which could contextualize the results more effectively.
The paper outlines the experimental setup and model configurations in detail, which is essential for reproducibility. However, the lack of publicly available code or a clear project URL limits the ability for other researchers to replicate the findings. Providing a GitHub repository with the implementation would significantly enhance reproducibility and foster further research in this area.
One limitation is the absence of a comparative analysis with other advanced techniques in the field of speech deepfake detection, which could provide a more comprehensive understanding of the GRPO approach's relative performance. Additionally, the paper does not address the potential computational overhead introduced by the GRPO fine-tuning process compared to SFT, which could be a consideration for practical applications.
The findings of this research have significant implications for the field of speech deepfake detection, particularly in enhancing the robustness of models against unseen attacks. As deepfake technology continues to evolve, improving detection methods is crucial for maintaining the integrity of audio content across various applications, including media, security, and communication. The insights gained from this study could inspire further innovations in model training paradigms and contribute to the development of more resilient AI systems. The main contribution of this paper is the introduction of a reinforcement learning-based fine-tuning approach for speech deepfake detection, demonstrating improved generalization capabilities compared to traditional supervised methods. This work significantly advances the understanding of model training in the context of speech deepfake detection and opens avenues for future research in applying reinforcement learning techniques to other domains within machine learning.
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.
Primary: University of Michigan
All Institutions: University of Michigan
The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.
The paper introduces a novel benchmarking framework, Spoof-SUPERB, specifically designed for evaluating self-supervised learning (SSL) models in the context of audio deepfake detection. The methodology is well-structured, utilizing a unified protocol for training and evaluation across multiple datasets, which enhances comparability. The choice of models and the systematic evaluation of their performance under various conditions, including acoustic degradations, is a significant strength. However, the paper could benefit from a more detailed discussion of the specific training and evaluation protocols used, as well as the rationale behind the selection of datasets.
The experiments are comprehensive, involving 20 different SSL models evaluated on multiple datasets, which provides a robust analysis of model performance. The results clearly demonstrate the superiority of large-scale discriminative models over generative ones, particularly in terms of resilience to noise and other acoustic degradations. The use of Equal Error Rate (EER) as a performance metric is appropriate for the task, although additional metrics could provide a more nuanced view of model performance.
The paper emphasizes reproducibility by establishing a fixed training setup and evaluation protocol, which is crucial for benchmarking in machine learning. However, the absence of a publicly accessible code repository or detailed implementation guidelines limits the ability of other researchers to reproduce the results. Providing such resources would significantly enhance the paper's impact.
One limitation is the potential overlap between the pretraining data of some models and the evaluation datasets, which could bias the results. Additionally, while the paper addresses robustness under various acoustic conditions, it does not explore the implications of different synthesis methods for audio deepfakes, which could be a critical area for future research.
The introduction of a standardized benchmark for audio deepfake detection is a timely contribution, given the increasing prevalence of deepfake technologies and their implications for security and trust in audio communications. This work could pave the way for further advancements in antispoofing techniques and the development of more secure speech processing systems. The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.
A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level and time differences (ITD/ILD). While Head-Related Transfer Function (HRTF) models based on rigid scattering suffice for terrestrial humans, they fail in underwater environments due to the near-impedance match between water and soft tissue. Motivated by the acoustic anatomy of underwater animals, we introduce a novel, analytically derived, closed-form forward model for scattering from a semi-transparent sphere containing two rigid spherical scatterers. This model accurately maps source direction, frequency, and material properties to the pressure field, capturing the complex physics of layered, penetrable structures. Critically, our model is implemented in a fully differentiable setting, enabling its integration with a machine learning algorithm to optimize a cost function for active localization. We demonstrate enhanced convergence for localization under noise using a physics-informed frequency weighting scheme, and present accurate moving-source tracking via an Extended Kalman Filter (EKF) with analytically computed Jacobians. Our work suggests that differentiable models of scattering from layered rigid and transparent geometries offer a promising new foundation for microphone arrays that leverage scattering-based spatial cues over conventional beamforming, applicable to both terrestrial and underwater applications. Our model will be made open source.
Primary: University of Maryland
All Institutions: University of Maryland, SDU | All: & Reality Lab
The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.
The paper presents a novel analytical framework for modeling sound scattering in underwater environments, utilizing a differentiable multi-sphere scattering model. The approach is grounded in biological principles of spatial hearing and employs multipole expansions to derive a closed-form solution. The implementation in a differentiable programming framework (JAX) allows for efficient gradient-based optimization, which is a significant advancement over traditional methods that do not provide gradients. This differentiability enables the integration of the model with machine learning algorithms for active localization, showcasing a well-thought-out methodology that bridges physics-based modeling and machine learning.
The experiments conducted validate the proposed model through simulations that demonstrate its ability to accurately capture interaural level differences (ILD) and interaural time differences (ITD) under various conditions. The results show that the model effectively generates realistic binaural cues and performs robustly in source localization tasks, even under noise. The use of an Extended Kalman Filter (EKF) for tracking moving sources further emphasizes the practical applicability of the model. The experiments are comprehensive, covering various source directions and noise levels, which strengthens the findings.
While the paper mentions that the model will be made open source, specific details regarding the implementation and access to the code are not provided. This lack of direct access to the code and data limits the reproducibility of the results. However, the detailed methodology and equations presented allow for potential replication by researchers with sufficient expertise in the field.
The primary limitation of the study is the reliance on a simplified geometric model that may not capture all complexities of real-world underwater environments. Additionally, the experiments are conducted in a controlled simulation setting, which may not fully represent the challenges faced in practical applications, such as reverberation and multi-source scenarios. The model's performance in more complex acoustic environments remains to be tested.
This research has significant implications for the development of advanced acoustic sensing systems, particularly in underwater environments where traditional methods struggle. The ability to accurately model sound scattering and utilize spatial cues for localization can enhance various applications, including marine biology research, underwater navigation, and surveillance. The open-source nature of the model could foster further research and development in this area, promoting collaboration and innovation. The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFlytek Co., Ltd., Huawei Technology
The main contribution of this paper is the development of the DARS framework, which effectively synthesizes dysarthric speech to enhance automatic speech recognition performance, addressing a critical gap in assistive technology for individuals with speech impairments. The combination of innovative methodologies and rigorous experimental validation positions this work as a significant advancement in the field of speech synthesis and recognition.
The DARS framework introduces innovative mechanisms for synthesizing dysarthric speech, specifically a multi-stage rhythm predictor and a dysarthria-aware conditional flow matching mechanism. The use of contrastive preference optimization to guide the rhythm predictor is particularly novel, as it directly addresses the variability in dysarthric speech patterns. The integration of pause modeling and acoustic style vectors enhances the synthesis quality, making the approach well-suited for the complexities of dysarthric speech.
The paper presents a thorough experimental evaluation using the TORGO dataset, demonstrating the effectiveness of the DARS framework in enhancing ASR performance. The reported results, including a 54.22% relative reduction in WER, indicate significant improvements over existing methods. The experiments are well-structured, comparing multiple training strategies and adaptation techniques, which adds robustness to the findings.
While the paper provides a detailed description of the methodology and experimental setup, the absence of URLs for code or demo pages limits reproducibility. Clearer documentation or supplementary materials would enhance the ability for others to replicate the results.
The study relies on a limited dataset (TORGO), which may affect the generalizability of the results. Additionally, while the framework shows promise, the performance on more diverse dysarthric speech samples and real-world scenarios remains to be validated.
The DARS framework has the potential to significantly improve communication aids for individuals with dysarthria, enhancing their quality of life. By improving ASR systems' ability to recognize dysarthric speech, this research could facilitate better interaction and accessibility for affected individuals in various settings. The main contribution of this paper is the development of the DARS framework, which effectively synthesizes dysarthric speech to enhance automatic speech recognition performance, addressing a critical gap in assistive technology for individuals with speech impairments. The combination of innovative methodologies and rigorous experimental validation positions this work as a significant advancement in the field of speech synthesis and recognition.
Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: https://wflrz123.github.io/
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFlytek Co., Ltd.
The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.
The proposed end-to-end simultaneous dysarthric speech reconstruction (E2E-SDSR) system introduces innovative components such as a frame-level adaptor module and a multiple wait-k autoregressive TTS module. The frame-level adaptor effectively bridges the gap between ASR and TTS, enhancing the robustness of the system against ASR errors through explicit-implicit semantic information fusion. The multiple wait-k strategy in the TTS module allows for flexibility in processing, balancing latency and prosody quality. The methodology is well-structured, with a clear focus on addressing the unique challenges posed by dysarthric speech, particularly in terms of intelligibility and naturalness.
The experiments are comprehensive, utilizing both a commercial dysarthric speech dataset and the UASpeech dataset. The reported results, including a mean opinion score (MOS) of 4.67 and a 54.25% reduction in word error rate (WER), demonstrate significant improvements over existing methods. The ablation studies provide valuable insights into the contributions of each component of the proposed system, reinforcing the effectiveness of the adaptor and wait-k strategies.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as hyperparameter settings, training duration, and the exact architecture configurations used. The absence of a publicly available code repository limits the reproducibility of the results.
The study primarily focuses on dysarthric speech and may not generalize well to other speech disorders or languages. Additionally, the reliance on a limited dataset for training and testing could affect the robustness of the model in real-world applications. The paper does not address potential biases in the dataset or the implications of using commercial data.
The proposed system has the potential to significantly improve communication for individuals with dysarthria, enhancing their quality of life and social interactions. By providing a more efficient and intelligible speech reconstruction method, it could be applied in various assistive technologies and communication devices. The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription accuracy. Our best single cascaded system achieves a Speaker Word Error Rate (WER) of 32.44% on the development set. By further applying ROVER to fuse outputs from diverse front-end and back-end variants, we reduce Speaker WER to 31.40%. Notably, our LLM-based zero-shot conversational clustering achieves a speaker clustering F1 score of 1.0, yielding a final Joint ASR-Clustering Error Rate (JACER) of 15.70%.
Primary: University of Science and Technology of China
All Institutions: Anhui University, Lomonosov Moscow State University, iFLYTEK Research, Shaanxi Normal University, University of Science and Technology of China, iFLYTEK Co
This paper makes a notable contribution to the field of audio-visual speech recognition and clustering by proposing an integrated framework that effectively addresses the challenges posed by overlapping conversations in complex acoustic environments. The technical contributions, particularly the innovative use of LLMs for conversation clustering, position this work as a significant advancement in the domain.
The paper presents a sophisticated multimodal cascaded system that integrates audio-visual data to tackle the complex problem of recognizing and clustering multiple concurrent conversations. The methodology is robust, employing a two-stage transfer learning strategy for Active Speaker Detection (ASD) and a comprehensive approach for Audio-Visual Target Speech Extraction (AVTSE) and Audio-Visual Speech Recognition (AVSR). The incorporation of large language models (LLMs) for conversational clustering is particularly innovative, leveraging semantic understanding to enhance accuracy. The use of diverse datasets for training and the detailed architecture of each system component demonstrate a thorough and well-thought-out methodology.
The experiments are extensive, with a clear focus on evaluating the performance of each component in the pipeline. The results indicate significant improvements over baseline models, particularly in Speaker Word Error Rate (WER) and Joint ASR-Clustering Error Rate (JACER). The paper provides detailed comparisons across different system configurations, showcasing the effectiveness of the proposed methods. However, the absence of a comprehensive ablation study to isolate the contributions of each component limits the depth of the evaluation.
While the paper outlines the methodologies and datasets used, it lacks specific implementation details that would aid in reproducing the results. There are no links to code repositories or supplementary materials, which is a significant drawback for reproducibility in machine learning research.
The paper does not address potential limitations of the proposed systems, such as the computational cost associated with using large language models and the challenges of real-time application in practical scenarios. Additionally, the reliance on extensive datasets may not be feasible for all research groups, limiting the accessibility of the proposed methods.
The work has significant implications for applications in real-world scenarios involving multi-party conversations, such as meetings, conferences, and social interactions. The ability to accurately recognize and cluster overlapping speech can enhance communication technologies, assistive devices, and automated transcription services, contributing to advancements in human-computer interaction. This paper makes a notable contribution to the field of audio-visual speech recognition and clustering by proposing an integrated framework that effectively addresses the challenges posed by overlapping conversations in complex acoustic environments. The technical contributions, particularly the innovative use of LLMs for conversation clustering, position this work as a significant advancement in the domain.
We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read speech, news narration, or audiobook content, VietSuperSpeech is sourced from four publicly accessible YouTube channels spanning everyday conversation, personal vlogging, overseas Vietnamese community dialogue, and informal commentary - the very speech styles encountered in real-world chatbot, customer support, call center, and hotline deployments. All audio is standardized to 16 kHz mono PCM WAV and segmented into 3-30 second utterances. Transcriptions are generated via pseudo-labeling using the Zipformer-30M-RNNT-6000h model (Nguyen, 2025) deployed through Sherpa-ONNX, pre-trained on 6,000 hours of Vietnamese speech. After quality filtering, the dataset is split into 46,822 training samples (240.67 hours) and 5,201 development/test samples (26.72 hours) with a fixed random seed. The text averages 266 characters per utterance, totaling 13.8 million fully diacritically marked Vietnamese characters. We demonstrate that VietSuperSpeech fills a critical gap in the Vietnamese ASR ecosystem: while corpora such as VLSP2020, VIET_BUD500, VietSpeech, FLEURS, VietMed, Sub-GigaSpeech2-Vi, viVoice, and Sub-PhoAudioBook provide broad coverage of formal and read speech, none specifically targets the casual, spontaneous register indispensable for conversational AI applications. VietSuperSpeech is publicly released at https://huggingface.co/datasets/thanhnew2001/VietSuperSpeech.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of VietSuperSpeech, a large-scale dataset specifically designed for casual conversational speech in Vietnamese, which fills a critical gap in the existing ASR corpus landscape. This dataset's unique focus on informal speech patterns and its potential applications in various conversational AI domains make it a significant resource for advancing ASR technology in low-resource languages.
The methodology of VietSuperSpeech is robust, focusing on the collection of conversational speech from diverse YouTube channels, which is a significant departure from existing datasets that primarily feature formal speech. The use of pseudo-labeling through the Zipformer-30M-RNNT-6000h model is well-justified, and the quality control measures implemented during transcription generation strengthen the dataset's reliability. However, the paper could benefit from a more detailed description of the pseudo-labeling quality assessment process and the specific metrics used to evaluate the performance of the ASR model on this dataset.
The paper does not provide extensive experimental results demonstrating the effectiveness of the VietSuperSpeech dataset in improving ASR performance in conversational contexts. While the authors discuss the dataset's intended applications and the acoustic properties of the speech it contains, empirical validation through experiments that compare ASR performance on this dataset versus existing corpora would significantly enhance the paper's impact.
The authors have made the dataset publicly available, which is a positive step towards reproducibility. The details regarding the audio preprocessing and pseudo-labeling pipeline are adequately described, allowing other researchers to replicate the dataset creation process. However, the lack of shared experimental results or code for training ASR models on this dataset limits the overall reproducibility of the findings.
The paper acknowledges several limitations, including the potential for pseudo-label noise and the demographic balance of the speaker population. The dataset's reliance on YouTube content may also restrict its representativeness of all conversational registers, particularly in specialized domains. Additionally, the authors note that the dataset may not fully capture the nuances of highly noisy environments typical in call centers.
VietSuperSpeech has significant implications for the development of ASR systems in Vietnamese, particularly for applications in customer support, chatbots, and IVR systems. By addressing the gap in conversational speech datasets, it provides a valuable resource for researchers and practitioners aiming to improve ASR performance in real-world scenarios. The dataset's public availability encourages further research and development in this area, potentially leading to advancements in Vietnamese language technology. The main contribution of this work is the introduction of VietSuperSpeech, a large-scale dataset specifically designed for casual conversational speech in Vietnamese, which fills a critical gap in the existing ASR corpus landscape. This dataset's unique focus on informal speech patterns and its potential applications in various conversational AI domains make it a significant resource for advancing ASR technology in low-resource languages.
We introduce LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection, comprising 2,732 hours of audio generated with 24 open-source TTS systems across 66 languages, including 45 low-resource languages under our operational definition. To evaluate robustness without requiring target-domain bonafide speech, we benchmark 11 publicly available countermeasures using threshold transfer: for each model we calibrate an EER operating point on pooled external benchmarks and apply the resulting threshold, reporting spoof rejection rate (SRR). Results show model-dependent cross-lingual disparity, with spoof rejection varying markedly across languages even under controlled conditions, highlighting language as an independent source of domain shift in spoof detection. The dataset is publicly available at \href{https://huggingface.co/datasets/MTUCI/LRLspoof}{\textbf{\underline{\textit{HuggingFace}}}} and \href{https://modelscope.cn/datasets/lab260/LRLspoof}{\textbf{\underline{\textit{ModelScope}}}}
Primary: Moscow Technical University of Communications and Informatics
All Institutions: Moscow Technical University of Communications and Informatics
The main contribution of this paper is the introduction of LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection. This work significantly advances the field by providing a valuable resource for evaluating spoof detection systems across a wide range of languages, particularly those that are often underrepresented in existing research.
The methodology presented in this paper is robust, focusing on the creation of a large-scale multilingual synthetic-speech corpus specifically designed for cross-lingual spoof detection. The authors employed 24 open-source TTS systems to generate 2,732 hours of audio across 66 languages, which is a significant contribution to the field of anti-spoofing research, especially for low-resource languages. The use of threshold transfer for evaluating countermeasures is a clever approach that allows for the assessment of model performance without the need for bonafide speech from the target domain. However, the paper could benefit from a more detailed explanation of the calibration process and the specific metrics used for evaluating the countermeasures.
The experiments conducted are comprehensive, benchmarking 11 publicly available countermeasures and reporting on the spoof rejection rate (SRR). The results indicate a model-dependent cross-lingual disparity, which is an important finding that highlights the challenges in spoof detection across different languages. The controlled conditions of the experiments lend credibility to the findings, but the paper could improve by providing more detailed statistical analyses and comparisons between the different models tested.
The dataset is publicly available, which is a significant step towards ensuring reproducibility. However, the paper does not provide sufficient details regarding the implementation of the countermeasures or the specific configurations used during the experiments. Including code or detailed experimental setups would enhance reproducibility and allow other researchers to validate the findings more easily.
One limitation of the study is the reliance on synthetic speech, which may not fully capture the complexities and variabilities present in real-world scenarios. Additionally, the focus on 66 languages, while commendable, may overlook certain dialects or variations within those languages that could affect spoof detection performance. The paper also does not address the potential biases introduced by the TTS systems used for generating the audio samples.
The implications of this research are significant, particularly in the context of increasing reliance on voice-based authentication systems. By addressing spoof detection in low-resource languages, the study opens avenues for improving security measures in diverse linguistic contexts. The findings could influence future research directions and the development of more inclusive anti-spoofing technologies. The main contribution of this paper is the introduction of LRLspoof, a large-scale multilingual synthetic-speech corpus for cross-lingual spoof detection. This work significantly advances the field by providing a valuable resource for evaluating spoof detection systems across a wide range of languages, particularly those that are often underrepresented in existing research.
We propose TQCodec, a neural audio codec designed for high-bitrate, high-fidelity music streaming. Unlike existing neural codecs that primarily target ultra-low bitrates (<= 16kbps), TQCodec operates at 44.1 kHz and supports bitrates from 32 kbps to 128 kbps, aligning with the standard quality of modern music streaming platforms. The model adopts an encoder-decoder architecture based on SEANet for efficient on-device computation and introduces several enhancements: an imbalanced network design for improved quality with low overhead, SimVQ for mid-frequency detail preservation, and a phase-aware waveform loss. Additionally, we introduce a perception-driven band-wise bit allocation strategy to prioritize perceptually critical lower frequencies. Evaluations on diverse music datasets demonstrate that TQCodec achieves superior audio quality at target bitrates, making it well-suited for high-quality audio applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of TQCodec, a neural audio codec tailored for high-bitrate music streaming that incorporates several innovative techniques to enhance audio quality while maintaining computational efficiency. The paper presents a solid foundation for future research in high-fidelity audio codecs, though it requires improvements in methodological clarity and experimental rigor to fully realize its potential impact in the field.
The proposed TQCodec utilizes an encoder-decoder architecture based on SEANet, which is a significant adaptation for high-fidelity music streaming. The enhancements introduced, such as the imbalanced network design, SimVQ for mid-frequency detail preservation, and phase-aware waveform loss, are well thought out and address specific challenges in audio codec design. The perception-driven band-wise bit allocation strategy is particularly innovative, as it prioritizes perceptually critical frequencies, which is crucial in audio processing. However, the methodology could benefit from more detailed explanations of the implementation and the specific advantages of each enhancement over existing methods.
The evaluation of TQCodec on diverse music datasets is commendable, as it demonstrates the codec's performance across various conditions. The use of objective metrics like LSD and SNR to validate audio quality is appropriate, though the paper would benefit from comparative subjective listening tests to further substantiate claims of superior audio quality. The results indicate that TQCodec outperforms existing baselines, which is a strong point, but the paper lacks a thorough discussion on the datasets used, including their diversity and representativeness.
The paper does not provide sufficient details on the implementation of TQCodec, which raises concerns about reproducibility. Key aspects such as hyperparameter settings, training procedures, and the specific datasets used for training and evaluation are not adequately described. Including this information would enhance the paper's reproducibility and allow other researchers to build upon the work.
One limitation is the lack of subjective evaluation metrics, which are essential in audio quality assessment. Additionally, while the focus on high-bitrate codecs is valuable, the paper does not address how TQCodec performs at lower bitrates, which could limit its applicability in scenarios with constrained bandwidth. The computational efficiency claims should also be backed by more detailed performance benchmarks.
TQCodec has the potential to significantly impact the field of audio streaming by providing a high-fidelity codec that meets the demands of modern music platforms. Its design is particularly relevant as the industry shifts towards higher quality audio streaming. The advancements in neural audio codecs could also influence related fields, such as music generation and audio synthesis, by providing more efficient and effective tools for audio processing. The main contribution of this paper is the introduction of TQCodec, a neural audio codec tailored for high-bitrate music streaming that incorporates several innovative techniques to enhance audio quality while maintaining computational efficiency. The paper presents a solid foundation for future research in high-fidelity audio codecs, though it requires improvements in methodological clarity and experimental rigor to fully realize its potential impact in the field.
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The paper presents AG-REPA, a framework that enhances audio generation quality by focusing on causal contributions of layers rather than mere representational richness. This innovative approach not only improves training efficiency but also contributes to the interpretability of generative models, marking a meaningful advancement in the field of machine learning.
The paper introduces a novel methodology called Attribution-Guided REPresentation Alignment (AG-REPA), which emphasizes causal layer selection for representation alignment in audio flow matching. This approach is grounded in the theoretical concept of Store-Contribute Dissociation (SCD), which reveals that layers rich in semantic information do not necessarily contribute most to the generative process. The methodology includes a forward-only gate ablation (FoG-A) to quantify each layer's causal contribution, allowing for adaptive layer selection and weighting. This is a significant advancement over traditional heuristic methods, providing a more principled basis for layer selection in generative models.
The experiments are robust, utilizing well-established datasets such as LibriSpeech and AudioSet for unified speech and general audio training. The results demonstrate that AG-REPA consistently outperforms baseline methods, achieving significant reductions in Fréchet Audio Distance (FAD) and improvements in perceptual quality metrics like Word Error Rate (WER) and Mean Opinion Score (MOS). The comparative analysis against static REPA baselines and other alignment strategies provides strong empirical support for the proposed method.
The paper outlines a clear methodology and experimental setup, but lacks specific implementation details or code availability, which could hinder reproducibility. The authors mention a "probe-then-intervene" training protocol that separates diagnostic probing from optimization, which is a good practice for ensuring clean experimental conditions.
One limitation is the lack of external validation on diverse datasets beyond LibriSpeech and AudioSet, which may limit the generalizability of the findings. Additionally, while the paper discusses potential risks associated with high-fidelity audio generation, it does not provide a detailed framework for mitigating these risks in practical applications.
The work has significant implications for the field of audio generation, particularly in enhancing the intelligibility and quality of synthesized speech and audio. The interpretability toolkit developed in this study could also pave the way for more transparent and controllable generative models in AI, addressing some of the ethical concerns surrounding deepfake technologies. The paper presents AG-REPA, a framework that enhances audio generation quality by focusing on causal contributions of layers rather than mere representational richness. This innovative approach not only improves training efficiency but also contributes to the interpretability of generative models, marking a meaningful advancement in the field of machine learning.
Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology
The paper presents SyncTrack, a novel model for synchronous multi-track music generation that significantly enhances rhythmic stability and synchronization. The technical contributions, including innovative architecture and evaluation metrics, position this work as a meaningful advancement in the field of machine learning for audio applications.
The paper introduces SyncTrack, a novel architecture for multi-track music generation that effectively addresses rhythmic stability and synchronization through the integration of track-shared and track-specific modules. The use of cross-track attention mechanisms is innovative, allowing for both global and time-specific synchronization of rhythms across tracks. The proposed metrics for evaluating rhythmic consistency (IRS, CBS, CBD) are well-conceived and fill a significant gap in the assessment of multi-track music generation quality. The methodology is clearly articulated, with a logical flow from problem identification to solution proposal.
The experiments are comprehensive, utilizing both objective metrics (FAD, IRS, CBS, CBD) and subjective evaluations to validate the performance of SyncTrack against state-of-the-art baselines. The results demonstrate significant improvements in both rhythmic stability and synchronization, with clear statistical backing. The use of the Slakh2100 dataset is appropriate for the task, and the ablation studies provide insights into the contributions of different components of the model.
The paper includes detailed implementation information, including training configurations, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of a demo or project URL limits accessibility for other researchers wishing to replicate the work.
While the proposed metrics are robust, the paper does not address potential limitations in the generalizability of SyncTrack across different musical genres or styles. Additionally, the reliance on specific datasets may introduce biases that affect the model's performance in broader applications.
The advancements in multi-track music generation have significant implications for the music industry, particularly in areas such as music production, remixing, and creative applications. By improving rhythmic stability and synchronization, SyncTrack could enhance the quality of generated music, making it more suitable for professional use. The paper presents SyncTrack, a novel model for synchronous multi-track music generation that significantly enhances rhythmic stability and synchronization. The technical contributions, including innovative architecture and evaluation metrics, position this work as a meaningful advancement in the field of machine learning for audio applications.
Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
Primary: Weizmann Institute of Science
All Institutions: Weizmann Institute of Science
The main contribution of this paper is the introduction of VoxKnesset, a large-scale longitudinal Hebrew speech dataset that enables the study of aging effects on speech, along with a comprehensive evaluation of modern speech embeddings in this context. This work represents a significant advancement in the field of speech processing, particularly for underrepresented languages and demographic studies.
The methodology employed in VoxKnesset is robust and well-structured, focusing on the creation of a large-scale longitudinal speech dataset specifically for Hebrew parliamentary speech. The authors detail a multi-stage alignment pipeline that addresses common issues in audio processing, such as timestamp inconsistencies and transcript normalization artifacts. The use of verified demographic metadata enhances the dataset's reliability. The longitudinal aspect of the dataset is particularly noteworthy, as it allows for the examination of vocal changes over time, a significant advancement over traditional cross-sectional datasets. The benchmarking of modern speech embeddings on age prediction and speaker verification is methodologically sound, providing a clear framework for evaluating the impact of aging on speech characteristics.
The experiments conducted are thorough and well-articulated, demonstrating the dataset's utility in real-world applications. The authors benchmark several state-of-the-art speech embeddings, providing a comprehensive analysis of their performance in both age prediction and speaker verification tasks. The results clearly show the degradation of speaker verification performance over time, highlighting the importance of longitudinal data in understanding vocal aging. The cross-dataset evaluations further validate the dataset's applicability and the robustness of the findings across different languages and contexts. The use of various metrics, including Mean Absolute Error (MAE) and Equal Error Rate (EER), adds depth to the evaluation.
While the paper mentions that the dataset and processing pipeline will be publicly released, specific implementation details are somewhat lacking. The authors provide a general overview of their methods, but more granular details about the experimental setup, hyperparameters, and the exact processing steps would enhance reproducibility. Clear documentation and access to code would be beneficial for other researchers looking to replicate or build upon this work.
The paper acknowledges several limitations, including the dataset's focus on a single speech register (parliamentary debate) and a demographic skew towards older adults. Additionally, the authors note that recording conditions may have evolved over the 16 years, which could introduce confounding variables. The challenge of disentangling channel drift from biological aging is also recognized, indicating that further research is needed to fully understand these dynamics.
The VoxKnesset dataset has significant implications for various applications, including biometric security, automated transcription, and health diagnostics. By addressing the aging of vocal characteristics, this work could lead to more robust and reliable speech processing systems that can adapt to individual changes over time. The dataset's release will likely stimulate further research in Hebrew speech processing and aging-related studies, contributing to the broader field of machine learning and speech technology. The main contribution of this paper is the introduction of VoxKnesset, a large-scale longitudinal Hebrew speech dataset that enables the study of aging effects on speech, along with a comprehensive evaluation of modern speech embeddings in this context. This work represents a significant advancement in the field of speech processing, particularly for underrepresented languages and demographic studies.
Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
Primary: Weizmann Institute of Science
All Institutions: Weizmann Institute of Science
The paper introduces VoxKnesset, a comprehensive longitudinal Hebrew speech dataset that addresses the critical challenge of vocal aging in speech processing systems. This work significantly contributes to the field by providing a large-scale resource that enables rigorous longitudinal evaluation and benchmarking of speech models, ultimately advancing the understanding of age-related changes in voice and their implications for technology.
The methodology employed in this paper is robust, leveraging a multi-stage alignment pipeline for audio processing and employing modern speech embeddings for age prediction and speaker verification. The use of verified demographic metadata enhances the dataset's reliability, and the longitudinal design addresses a significant gap in existing speech datasets. The experiments are well-structured, comparing various models and providing a comprehensive analysis of their performance over time.
The experiments are thorough, utilizing a large-scale dataset that spans 16 years and includes a diverse set of speakers. The benchmarking against established models and datasets adds credibility to the findings. The results demonstrate the degradation of speaker verification performance over time and highlight the importance of longitudinal training for capturing aging signals, which is a crucial contribution to the field.
The paper mentions that the dataset and processing pipeline will be publicly available, which is essential for reproducibility. However, specific implementation details, such as the exact configurations of the models used, are not fully disclosed, which could hinder complete reproducibility.
The dataset is limited to a single speech register (parliamentary debate) and may not capture the full diversity of speech variations across different contexts. Additionally, the demographic skew towards older adults and potential changes in recording conditions over the years may affect the generalizability of the findings.
The implications of this research are significant for various applications, including biometric security, health diagnostics, and aging-aware voice technologies. The dataset can serve as a foundational resource for future research in Hebrew speech processing and aging speaker modeling. The paper introduces VoxKnesset, a comprehensive longitudinal Hebrew speech dataset that addresses the critical challenge of vocal aging in speech processing systems. This work significantly contributes to the field by providing a large-scale resource that enables rigorous longitudinal evaluation and benchmarking of speech models, ultimately advancing the understanding of age-related changes in voice and their implications for technology.
Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.
Primary: Paul G. Allen School of Computer Science and Engineering, University of Washington
All Institutions: Paul G. Allen School of Computer Science and Engineering, University of Washington, Hearvana AI
Aurchestra introduces a groundbreaking approach to soundscape control on hearables, enabling users to manipulate multiple sound sources independently in real-time. The combination of innovative methodology and practical applications positions this work as a significant contribution to the field of audio machine learning, although further enhancements in reproducibility and comparative analysis are necessary for broader acceptance.
The methodology presented in Aurchestra is innovative as it combines a dynamic interface with a real-time multi-output extraction network tailored for resource-constrained hearables. The authors detail a robust architecture that allows for the detection and manipulation of multiple overlapping sound classes, which is a significant advancement over traditional binary noise cancellation systems. The use of on-device processing is particularly noteworthy, as it addresses the limitations of latency and resource usage in mobile devices. However, the paper could benefit from a more detailed description of the model architecture and the specific algorithms used for sound class detection and extraction.
The experimental evaluation is thorough, demonstrating the system's performance in real-world environments with diverse acoustic scenarios. The authors report substantial improvements in target-class enhancement and interference suppression, which are critical metrics for the effectiveness of soundscape control. However, the paper lacks a comprehensive comparison with existing methods, which would help contextualize the results and validate the claimed improvements. Additionally, further details on the datasets used for training and testing would enhance the credibility of the experimental findings.
The paper does not provide sufficient details regarding the implementation of the system, such as the specific datasets, training procedures, or hyperparameters used in the model. This lack of transparency could hinder reproducibility, which is a crucial aspect of machine learning research. Including a supplementary material section or a dedicated repository with code and data would significantly improve this aspect.
While the system shows promising results, there are limitations that need to be addressed. The performance in highly dynamic or noisy environments is not fully explored, and the scalability of the approach to more than five overlapping sounds is unclear. Additionally, the user interface design and user experience aspects are briefly mentioned but not thoroughly evaluated, which could impact the system's practical usability.
Aurchestra has the potential to revolutionize personal audio experiences, making it applicable in various fields such as augmented reality, hearing aids, and smart environments. By allowing users to customize their auditory experiences in real-time, it could enhance accessibility for individuals with hearing impairments and improve the overall quality of life in noisy urban settings. The implications for privacy and user control over their auditory environment are also significant, as this technology could empower users to manage their soundscapes actively. Aurchestra introduces a groundbreaking approach to soundscape control on hearables, enabling users to manipulate multiple sound sources independently in real-time. The combination of innovative methodology and practical applications positions this work as a significant contribution to the field of audio machine learning, although further enhancements in reproducibility and comparative analysis are necessary for broader acceptance.
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
Primary: Queen Mary University of London
All Institutions: Queen Mary University of London, Peking University, Technical University of Munich, Beijing University of Post and Telecommunications, SooChow University, University of Manchester, Hong Kong University of Science and Technology
This paper presents a comprehensive framework for evaluating music generation models through a novel benchmark and datasets, significantly advancing the state of the art in music reward modeling. The methodology is rigorous, and the results demonstrate a clear improvement in aligning model outputs with human preferences, making it a valuable contribution to the field of machine learning and music technology.
The paper introduces a novel framework for evaluating music generation models using Compositional Multimodal Instruction (CMI). It constructs two datasets—CMI-Pref-Pseudo and CMI-Pref—alongside a unified benchmark, CMI-RewardBench, which assesses models on multiple dimensions of musicality and alignment. The methodology is robust, utilizing both pseudo-labeling and expert annotations, and employs a parameter-efficient architecture for the reward models, allowing for effective processing of heterogeneous inputs. The two-stage training strategy enhances the model's performance by leveraging both large-scale pseudo-labeled data and high-quality human annotations.
The experiments are comprehensive, demonstrating the effectiveness of the proposed CMI-RM against existing baselines across various tasks. The results show strong correlations with human judgments, indicating that the proposed models can effectively evaluate music generation quality. The paper provides detailed metrics and comparisons, showcasing the advantages of the CMI-RewardBench in capturing the nuances of human preferences in music generation.
The authors have made their datasets, benchmark, and model weights publicly available, which enhances reproducibility. The detailed methodology, including the training protocols and evaluation metrics, is well-documented, allowing other researchers to replicate the experiments. However, the reliance on specific models for pseudo-labeling may introduce variability that is not fully accounted for.
One limitation is the potential bias in the pseudo-labeling process, which may affect the quality of the training data. Additionally, while the framework addresses the complexity of multimodal inputs, the evaluation may still be subjective, as musicality and alignment can vary significantly based on individual listener preferences. The paper also does not extensively discuss the scalability of the approach to larger datasets or different musical genres.
This work has significant implications for the field of music generation and evaluation, providing a structured approach to assess models that can handle complex multimodal inputs. The availability of the datasets and benchmark can spur further research in aligned music generation and improve the quality of AI-generated music in commercial applications. The methodology could also be adapted for use in other creative domains where multimodal inputs are prevalent. This paper presents a comprehensive framework for evaluating music generation models through a novel benchmark and datasets, significantly advancing the state of the art in music reward modeling. The methodology is rigorous, and the results demonstrate a clear improvement in aligning model outputs with human preferences, making it a valuable contribution to the field of machine learning and music technology.
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
Primary: Queen Mary University of London
All Institutions: Queen Mary University of London, Peking University, Technical University of Munich, Beijing University of Post and Telecommunications, SooChow University, University of Manchester, Hong Kong University of Science and Technology
The paper presents a novel framework for evaluating music reward models under compositional multimodal instructions, significantly enhancing the landscape of music generation evaluation. The combination of innovative methodologies, comprehensive datasets, and robust experimental validation positions this work as a meaningful contribution to the field of machine learning and music technology.
The paper introduces a comprehensive methodology for evaluating music generation models through the CMI-RewardBench framework. It effectively combines large-scale pseudo-labeled datasets with high-quality human annotations to create a robust evaluation system that addresses the complexities of multimodal inputs. The architecture of the CMI reward models (CMI-RMs) is well-structured, utilizing a two-tower multimodal approach that processes diverse input types (text, lyrics, audio) to predict musicality and alignment scores. The methodology is innovative in its integration of various datasets and the introduction of a unified benchmark, which is a significant advancement in the field of music generation evaluation.
The experimental evaluation is thorough, utilizing a variety of datasets, including the newly created CMI-Pref and CMI-Pref-Pseudo, alongside established benchmarks like PAM and MusicEval. The results demonstrate that CMI-RMs achieve competitive performance, particularly in correlating with human judgments on musicality and instruction adherence. The experiments also highlight the effectiveness of the proposed test-time scaling strategy, providing empirical evidence of the model's capabilities across different evaluation tasks. However, the paper could benefit from additional comparative analysis with more diverse models to further validate the claims.
The authors provide clear links to the datasets and code repositories, which is crucial for reproducibility. The detailed description of the model architecture, training strategies, and evaluation protocols enhances the likelihood that other researchers can replicate the results. However, the paper could improve by including more specific implementation details, such as hyperparameter settings and computational resources used during training.
One limitation is the reliance on pseudo-labels for a significant portion of the data, which may introduce noise or bias in the training process. Additionally, while the CMI-RewardBench framework is comprehensive, it may not cover all possible multimodal scenarios in music generation, potentially limiting its applicability. The paper also does not address the potential copyright issues related to the generated music, which could pose ethical concerns in practical applications.
This work has the potential to significantly advance the evaluation of music generation systems, providing a more nuanced understanding of how these models align with human preferences. The public availability of datasets and models encourages further research and development in the field, promoting innovation in AI-generated music. The implications extend to various applications in the creative industries, including music production, film scoring, and interactive entertainment. The paper presents a novel framework for evaluating music reward models under compositional multimodal instructions, significantly enhancing the landscape of music generation evaluation. The combination of innovative methodologies, comprehensive datasets, and robust experimental validation positions this work as a meaningful contribution to the field of machine learning and music technology.
Test-Time Scaling has shown notable efficacy in addressing complex problems through scaling inference compute. However, within Large Audio-Language Models (LALMs), an unintuitive phenomenon exists: post-training models for structured reasoning trajectories results in marginal or even negative gains compared to post-training for direct answering. To investigate it, we introduce CAFE, an evaluation framework designed to precisely quantify audio reasoning errors. Evaluation results reveal LALMs struggle with perception during reasoning and encounter a critical bottleneck: reasoning performance suffers from audio perception decay as reasoning length extends. To address it, we propose MPAR$^2$, a paradigm that encourages dynamic perceptual reasoning and decomposes complex questions into perception-rich sub-problems. Leveraging reinforcement learning, MPAR$^2$ improves perception performance on CAFE from 31.74% to 63.51% and effectively mitigates perception decay, concurrently enhancing reasoning capabilities to achieve a significant 74.59% accuracy on the MMAU benchmark. Further analysis demonstrates that MPAR$^2$ reinforces LALMs to attend to audio input and dynamically adapts reasoning budget to match task complexity.
Primary: Northeastern University
All Institutions: Northeastern University, NiuTrans Research
The main contribution of this paper is the introduction of a comprehensive evaluation framework and a novel training paradigm to address the challenges of audio perception decay in LALMs. This work significantly advances the understanding of reasoning in audio contexts and proposes actionable solutions to improve model performance, marking a valuable addition to the field of machine learning.
The paper introduces a novel framework, CAFE, for evaluating audio reasoning errors in Large Audio-Language Models (LALMs) and proposes MPAR², a two-stage training strategy that combines supervised learning and reinforcement learning to enhance audio perception and reasoning. The methodology is well-structured, addressing the identified issue of audio perception decay during extended reasoning processes. The detailed description of the training phases and the innovative reward mechanisms (perception reward, stepwise perception-reasoning reward, and review-enhanced accuracy reward) provide a comprehensive approach to tackling the problem.
The experiments are robust, utilizing multiple benchmarks (MMAU and MMAR) to validate the effectiveness of MPAR². The results demonstrate significant improvements in perception accuracy and reasoning performance, with clear metrics defined for evaluation. The paper includes a thorough analysis of the results, showing how the proposed methods outperform existing models and effectively mitigate perception decay.
The paper provides sufficient details regarding the experimental setup, including model configurations, training procedures, and evaluation metrics. However, the absence of a demo URL and the reliance on external datasets may hinder full reproducibility. The provided GitHub link for the framework is a positive aspect, as it allows for further exploration of the methodology.
While the paper presents a strong framework, it does not address potential scalability issues or the computational costs associated with the proposed methods. Additionally, the focus on audio reasoning may limit the generalizability of the findings to other modalities or tasks. The paper could benefit from a more extensive discussion on the implications of the findings for future research.
The research has the potential to significantly impact the field of audio processing and multimodal learning by enhancing the reasoning capabilities of LALMs. Improved audio perception could lead to advancements in applications such as automated transcription, audio-based question answering, and interactive AI systems that require a nuanced understanding of audio content. The main contribution of this paper is the introduction of a comprehensive evaluation framework and a novel training paradigm to address the challenges of audio perception decay in LALMs. This work significantly advances the understanding of reasoning in audio contexts and proposes actionable solutions to improve model performance, marking a valuable addition to the field of machine learning.
The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
Primary: Tianjin University
All Institutions: Tianjin University
The paper presents Whisper-MLA, a novel architecture that reduces GPU memory consumption in ASR models while preserving performance. This work is significant as it addresses a critical bottleneck in deploying state-of-the-art ASR systems, particularly for long-form audio applications, thereby enhancing accessibility and usability in various practical contexts.
The proposed methodology introduces a novel architecture, Whisper-MLA, which effectively integrates Multi-Head Latent Attention (MLA) into the Whisper model. The authors adapt MLA specifically for absolute positional embeddings, which is a significant innovation given the existing limitations of applying MLA to encoder-decoder architectures. The systematic investigation of MLA's application across different attention modules is commendable, and the decision to focus on decoder self-attention for optimization reflects a well-thought-out approach to balancing memory efficiency and performance.
The experiments conducted on the LibriSpeech benchmark are extensive and demonstrate the effectiveness of the Whisper-MLA model in reducing GPU memory consumption significantly while maintaining competitive accuracy. The results clearly show that the proposed model achieves up to 87.5% reduction in KV cache size, which is a critical metric for real-world applications, especially in resource-constrained environments. The comparative analysis with the original Whisper model provides a solid basis for the claims made.
The paper provides sufficient details regarding the experimental setup, including the model architecture, training parameters, and the dataset used. However, the lack of a publicly available code repository limits the reproducibility of the results. The authors mention that their source code is publicly available, but no specific URL is provided in the text, which is a missed opportunity for enhancing reproducibility.
One limitation is that the paper primarily focuses on the decoder self-attention mechanism, potentially overlooking the benefits that could be derived from optimizing other components of the model. Additionally, while the results are promising, the experiments are limited to the LibriSpeech dataset, which may not fully represent the model's performance across diverse ASR tasks and environments.
The Whisper-MLA architecture has significant implications for deploying large-scale ASR models in real-world applications, particularly in scenarios where GPU memory is a limiting factor. By reducing memory consumption while maintaining performance, this work could facilitate the use of advanced ASR technologies in mobile devices, embedded systems, and other resource-constrained environments. The paper presents Whisper-MLA, a novel architecture that reduces GPU memory consumption in ASR models while preserving performance. This work is significant as it addresses a critical bottleneck in deploying state-of-the-art ASR systems, particularly for long-form audio applications, thereby enhancing accessibility and usability in various practical contexts.
Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.
Primary: Hubei University of Technology
All Institutions: Hubei University of Technology, Hubei Key Laboratory of Digital Finance Innovation, Hubei University of Economics, Wuhan University of Technology
The main contribution of this paper is the introduction of SMDIM, a novel diffusion-based architecture that effectively addresses the challenges of long-sequence symbolic music generation by integrating structured state space models and hybrid refinement techniques. This work represents a meaningful advancement in the field, offering both theoretical insights and practical applications that could impact the future of music generation technologies.
The proposed SMDIM framework innovatively integrates structured state space models with diffusion modeling to address the challenges of long-sequence symbolic music generation. The methodology is well-articulated, with a clear explanation of the hybrid architecture that balances global structure modeling and local detail refinement. The introduction of the MFA block, which combines Mamba layers, feed-forward networks, and self-attention, is a significant contribution that enhances both efficiency and expressiveness. The theoretical underpinnings are solid, and the approach is tailored to the unique requirements of symbolic music, making it a meaningful advancement in the field.
The experimental evaluation is robust, utilizing a diverse set of datasets (MAESTRO, POP909, and FolkDB) that cover various musical styles. The paper presents comprehensive results, demonstrating that SMDIM outperforms state-of-the-art models in both generation quality and computational efficiency. The use of objective metrics such as average overlap area (OA) provides a quantitative basis for the claims, and the subjective evaluations through listening tests add depth to the assessment of musical quality. The ablation studies further strengthen the findings by elucidating the contributions of different components within the model.
The paper includes sufficient details about the training process, hyperparameters, and model architecture, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction. While the methodology is described in detail, having an accessible implementation would enhance the ability of other researchers to validate and build upon this work.
The paper acknowledges certain limitations, such as the model's tendency to produce musically implausible pitch ranges and overly dense vertical note stacking. Additionally, the structural coherence of generated music may degrade in longer compositions, indicating challenges in maintaining global musical form. These limitations suggest areas for future research, including the incorporation of constraints to improve musical plausibility and coherence.
The implications of this research are significant for the fields of music generation and multimedia content creation. By improving the efficiency and quality of symbolic music generation, SMDIM could facilitate advancements in automated music composition, interactive music applications, and educational tools for music theory. The model's ability to generalize across diverse musical styles also opens avenues for cross-cultural music generation, potentially enriching the landscape of automated music creation. The main contribution of this paper is the introduction of SMDIM, a novel diffusion-based architecture that effectively addresses the challenges of long-sequence symbolic music generation by integrating structured state space models and hybrid refinement techniques. This work represents a meaningful advancement in the field, offering both theoretical insights and practical applications that could impact the future of music generation technologies.