We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.
Primary: Peking University
All Institutions: Peking University, Tencent AILAB Group
The U-Codec presents a significant advancement in neural speech codecs, achieving high-fidelity speech synthesis at an unprecedented low frame rate of 5Hz, thereby enhancing computational efficiency while maintaining quality. The innovative methodology and comprehensive experimental validation position this work as a notable contribution to the field of machine learning and audio processing.
The proposed U-Codec introduces a novel architecture that combines a Transformer-based inter-frame long-term dependency module with a hierarchical global-local Transformer architecture, effectively addressing the challenges of speech synthesis at ultra-low frame rates. The systematic exploration of residual vector quantization (RVQ) depth and codebook size is well-structured, providing a comprehensive approach to optimizing speech quality under extreme compression. The introduction of the Codecformer network is particularly innovative, as it allows for efficient modeling of long sequences while maintaining high fidelity.
The experiments are robust, utilizing a large-scale multilingual dataset and a variety of evaluation metrics such as WER, PESQ, and STOI to assess performance. The results demonstrate significant improvements in inference speed and speech quality compared to existing high-frame-rate codecs, establishing a new benchmark in the field. However, the paper could benefit from additional comparative analyses with more recent state-of-the-art methods.
The paper provides a clear description of the training setup, datasets, and evaluation metrics, which enhances reproducibility. The release of a demo and code is a positive aspect, allowing others to validate the findings. However, specific implementation details, such as hyperparameters and training configurations, could be more thoroughly documented.
While the U-Codec shows promising results, it does not yet match the performance of certain high-frame-rate systems in terms of PESQ and SPK-SIM. Additionally, the complexity of the model increases with deeper RVQ stacks, which may limit practical applications in resource-constrained environments.
The U-Codec has the potential to significantly impact real-time speech synthesis applications, especially in scenarios where computational efficiency is critical, such as mobile devices and low-latency communication systems. Its ability to maintain high fidelity at ultra-low frame rates could lead to advancements in various fields, including virtual assistants, gaming, and accessibility tools. The U-Codec presents a significant advancement in neural speech codecs, achieving high-fidelity speech synthesis at an unprecedented low frame rate of 5Hz, thereby enhancing computational efficiency while maintaining quality. The innovative methodology and comprehensive experimental validation position this work as a notable contribution to the field of machine learning and audio processing.
Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of DELULU, a speaker-aware self-supervised foundational model that significantly enhances speaker-discriminative feature extraction in speech processing tasks. The innovative integration of external supervision and a dual training objective positions this work as a substantial advancement in the field of self-supervised learning for audio applications.
The methodology presented in DELULU is innovative, particularly in its integration of external supervision into the pseudo-label generation process. By utilizing frame-level embeddings from ReDimNet, the authors effectively introduce a speaker-discriminative inductive bias that enhances representation learning. The dual objective of masked prediction and denoising is a thoughtful approach that likely contributes to the model's robustness and generalization capabilities. However, the paper could benefit from a more detailed explanation of the k-means clustering step and how it interacts with the overall training process.
The experimental setup is comprehensive, with a clear focus on speaker-centric tasks. The reported results, including a 62% relative improvement in EER for speaker verification, are impressive and demonstrate the effectiveness of DELULU. The inclusion of zero-shot profiling tasks adds depth to the evaluation, showcasing the model's versatility. However, the paper lacks a comparison with a broader range of existing models, which would strengthen the claims of superiority.
The paper does not provide sufficient details on implementation, such as hyperparameters, training duration, or dataset specifics, which are crucial for reproducibility. While the results are promising, the absence of a code repository or supplementary materials limits the ability for other researchers to replicate the findings.
The paper acknowledges limitations, particularly in the reliance on external supervision, which may not be available in all scenarios. Additionally, the model's performance on diverse datasets beyond the ones tested could be a concern, as generalizability is critical in real-world applications.
DELULU has significant potential applications in speaker verification, diarization, and profiling, which are increasingly relevant in various sectors, including security and customer service. The model's ability to operate effectively without task-specific fine-tuning is particularly noteworthy, as it could facilitate broader adoption in practical applications. The main contribution of this paper is the introduction of DELULU, a speaker-aware self-supervised foundational model that significantly enhances speaker-discriminative feature extraction in speech processing tasks. The innovative integration of external supervision and a dual training objective positions this work as a substantial advancement in the field of self-supervised learning for audio applications.
The growing demand for home healthcare calls for tools that can support care delivery. In this study, we explore automatic health assessment from voice using real-world home care visit data, leveraging the diverse patient information it contains. First, we utilize Large Language Models (LLMs) to integrate Subjective, Objective, Assessment, and Plan (SOAP) notes derived from unstructured audio transcripts and structured vital signs into a holistic illness score that reflects a patient's overall health. This compact representation facilitates cross-visit health status comparisons and downstream analysis. Next, we design a multi-stage preprocessing pipeline to extract short speech segments from target speakers in home care recordings for acoustic analysis. We then employ an Audio Language Model (ALM) to produce plain-language descriptions of vocal biomarkers and examine their association with individuals' health status. Our experimental results benchmark both commercial and open-source LLMs in estimating illness scores, demonstrating their alignment with actual clinical outcomes, and revealing that SOAP notes are substantially more informative than vital signs. Building on the illness scores, we provide the first evidence that ALMs can identify health-related acoustic patterns from home care recordings and present them in a human-readable form. Together, these findings highlight the potential of LLMs and ALMs to harness heterogeneous in-home visit data for better patient monitoring and care.
Primary: Columbia University
All Institutions: Columbia University, Department of Computer Science, Department of Electrical Engineering, School of Nursing, The Fu Foundation School of Engineering and Applied Science
This study presents a pioneering approach to health assessment by leveraging LLMs and ALMs to analyze vocal biomarkers from home healthcare data. The innovative methodology and promising results indicate a significant step forward in the application of AI in healthcare, although further work is needed to address reproducibility and implementation challenges.
The methodology presented in this paper is robust, combining LLMs and ALMs to create a novel framework for health assessment based on vocal biomarkers. The integration of SOAP notes and vital signs into a unified illness score is innovative, allowing for a more holistic view of patient health. The multi-stage preprocessing pipeline for acoustic analysis is well-designed, addressing challenges inherent in real-world data collection. However, the reliance on LLMs for generating SOAP notes and illness scores raises questions about potential biases in the model outputs and the interpretability of the generated scores.
The experimental evaluation is thorough, with a clear focus on benchmarking various LLMs and ALMs. The use of real-world home care visit data adds significant value, as it reflects authentic patient-clinician interactions. The results demonstrate that LLM-generated illness scores align well with clinical outcomes, providing evidence of the method's effectiveness. However, the paper could benefit from more detailed statistical analysis and comparisons with traditional health assessment methods to strengthen the claims made.
The paper provides a comprehensive overview of the models and methods used, including specific LLMs and ALMs. However, it lacks detailed information on the implementation and access to the datasets, which may hinder reproducibility. The absence of a publicly accessible code repository or demo further limits the ability for others to replicate the study.
The study acknowledges several limitations, including the challenges of background noise and speaker overlap in real-world recordings. Additionally, the focus on the first 30 seconds of speech may overlook important acoustic cues that could emerge later in the conversation. The potential for LLMs to rely on contextual information rather than purely acoustic signals is another concern that warrants further investigation.
This research has significant implications for the future of home healthcare, particularly in enhancing patient monitoring through voice analysis. The findings suggest that vocal biomarkers can serve as valuable supplementary indicators of health status, which could lead to more timely interventions and improved patient outcomes. The approach also highlights the potential for integrating AI technologies into clinical practice, paving the way for more personalized and efficient healthcare solutions. This study presents a pioneering approach to health assessment by leveraging LLMs and ALMs to analyze vocal biomarkers from home healthcare data. The innovative methodology and promising results indicate a significant step forward in the application of AI in healthcare, although further work is needed to address reproducibility and implementation challenges.
Multimodal respiratory sound classification offers promise for early pulmonary disease detection by integrating bioacoustic signals with patient metadata. Nevertheless, current approaches remain vulnerable to spurious correlations from attributes such as age, sex, or acquisition device, which hinder their generalization, especially under distribution shifts across clinical sites. To this end, we propose a counterfactual adversarial debiasing framework. First, we employ a causal graph-based counterfactual debiasing strategy to suppress non-causal dependencies from patient metadata. Second, we introduce adversarial debiasing to learn metadata-insensitive representations and reduce metadata-specific biases. Third, we design counterfactual metadata augmentation to mitigate spurious correlations further and strengthen metadata-invariant representations. By doing so, our method consistently outperforms strong baselines in evaluations under both in-distribution and distribution shifts. The code is available at https://github.com/RSC-Toolkit/BTS-CARD.
Primary: University College London
All Institutions: University College London, MODULABS, RSC LAB, Republic of Korea
The main contribution of this paper is the development of a novel counterfactual adversarial debiasing framework that effectively mitigates spurious correlations in multimodal respiratory sound classification, enhancing model robustness and generalization across clinical environments. The integration of causal reasoning with adversarial training represents a significant advancement in the field, addressing critical challenges in deploying AI for healthcare applications.
The proposed methodology, BTS-CARD, integrates counterfactual reasoning with adversarial debiasing to address the challenges of spurious correlations in multimodal respiratory sound classification. The use of causal graphs to identify and mitigate non-causal dependencies is innovative and adds a robust theoretical foundation to the approach. The combination of counterfactual debiasing and adversarial training is well-justified, and the introduction of counterfactual metadata augmentation further enhances the model's ability to generalize across different clinical environments. However, the methodology could benefit from clearer explanations of the causal graph construction and the specific mechanisms of the adversarial training process.
The experiments are comprehensive, utilizing two distinct datasets (ICBHI and SPRSound) to evaluate both in-distribution and out-of-distribution performance. The results demonstrate significant improvements over strong baselines, particularly in OOD settings, which is a critical aspect of clinical AI applications. The ablation studies effectively highlight the contributions of each component of the proposed framework, reinforcing the importance of counterfactual debiasing and adversarial training. However, the paper could improve by providing more detailed statistical analyses and comparisons with additional state-of-the-art methods.
The paper provides a GitHub repository link for code access, which is a positive aspect for reproducibility. However, the implementation details could be expanded, particularly regarding hyperparameter tuning and the specific configurations used during training. Including a more detailed description of the datasets and preprocessing steps would also enhance reproducibility.
One limitation is the potential overfitting to the specific datasets used, which may not fully represent the diversity of real-world clinical settings. Additionally, while the method shows promise in improving OOD robustness, the trade-offs in sensitivity and specificity could pose challenges in clinical applications where false negatives are critical. The reliance on adversarial training may also introduce additional complexity in model training and deployment.
The research has significant implications for the deployment of AI in healthcare, particularly in enhancing the robustness of diagnostic tools against demographic and environmental variations. By addressing biases in respiratory sound classification, the proposed framework could lead to more equitable healthcare outcomes and improve the reliability of AI systems in diverse clinical settings. This work could pave the way for further research into causal inference and debiasing techniques in other medical domains. The main contribution of this paper is the development of a novel counterfactual adversarial debiasing framework that effectively mitigates spurious correlations in multimodal respiratory sound classification, enhancing model robustness and generalization across clinical environments. The integration of causal reasoning with adversarial training represents a significant advancement in the field, addressing critical challenges in deploying AI for healthcare applications.
This paper investigates the performance of Binaural Signal Matching (BSM) methods for near-field sound reproduction using a wearable glasses-mounted microphone array. BSM is a flexible, signal-independent approach for binaural rendering with arbitrary arrays, but its conventional formulation assumes far-field sources. In our previous work, we proposed a near-field extension of BSM (NF-BSM) that incorporates distance-dependent modeling and showed improved performance over far-field BSM using analytic data, though degradation persisted for sources very close to the array. In this study, we extend that analysis by using realistic simulated data of near-field Head-Related Transfer Functions (HRTFs) and Acoustic Transfer Functions (ATFs) of the array, accounting for listener head rotation and evaluating binaural cues such as interaural level and time differences (ILD and ITD). A key contribution is the introduction of a Field of View (FoV) weighting, designed to emphasize perceptually relevant directions and improve robustness under challenging conditions. Results from both simulation and a listening test confirm that NF-BSM outperforms traditional far-field BSM in near-field scenarios, and that the proposed NF-FoV-BSM method achieves the best perceptual and objective quality among all tested methods, particularly at close source distances and under head rotation. These findings highlight the limitations for far-field models in near-field sources and demonstrate that incorporating source distance and directional weighting can significantly improve binaural reproduction performance for wearable spatial audio systems.
Primary: 2 Sebastian Prepelita
All Institutions: 1 Boaz Rafaely, 1 Sapir Goldring, 2 Chad McKell, 2 David Lou Alon, 2 Sebastian Prepelita, 2 Zamir Ben Hur
The paper makes a significant contribution by advancing the understanding of binaural sound reproduction in near-field scenarios, demonstrating that incorporating distance and directional weighting can enhance audio quality for wearable systems. The innovative approach and thorough evaluation position this work as a valuable resource for researchers and practitioners in the audio and machine learning fields.
The paper presents a comprehensive methodology for enhancing binaural sound reproduction in near-field scenarios using a wearable microphone array. The authors extend the Binaural Signal Matching (BSM) framework by incorporating realistic simulated data and a novel Field of View (FoV) weighting approach. The mathematical formulations are well-defined, and the introduction of mixed error criteria for different frequency ranges demonstrates a sophisticated understanding of the problem. However, the complexity of the models and the reliance on simulations may limit practical applicability without real-world validation.
The experimental evaluation includes both simulation and listening tests, providing a robust framework for assessing the proposed methods. The results clearly show that the near-field BSM outperforms traditional far-field BSM, particularly in challenging conditions. The use of perceptual metrics such as interaural level and time differences adds depth to the evaluation. However, the paper could benefit from a larger variety of test conditions and more extensive user studies to validate the findings across diverse scenarios.
While the paper provides a detailed description of the methods and experimental setup, the lack of publicly available code or datasets limits reproducibility. The authors should consider sharing their simulation parameters and results to facilitate further research and validation by the community.
The primary limitation is the performance degradation observed at very close source distances, indicating that the current models may not fully capture the complexities of binaural reproduction in extreme near-field conditions. Additionally, the reliance on simulated data raises questions about the generalizability of the results to real-world applications.
The findings have significant implications for applications in virtual reality, teleconferencing, and wearable audio systems, where accurate binaural sound reproduction is crucial for immersive experiences. By addressing the limitations of existing far-field models, this research opens avenues for improved audio technologies that can adapt to dynamic environments and user interactions. The paper makes a significant contribution by advancing the understanding of binaural sound reproduction in near-field scenarios, demonstrating that incorporating distance and directional weighting can enhance audio quality for wearable systems. The innovative approach and thorough evaluation position this work as a valuable resource for researchers and practitioners in the audio and machine learning fields.
Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PromptReverb, a novel two-stage generative framework for room impulse response generation that leverages multimodal inputs and advanced machine learning techniques to overcome existing limitations in the field. This work significantly advances the state of the art in audio synthesis and has the potential to impact various practical applications in immersive audio environments.
The methodology presented in PromptReverb is innovative, employing a two-stage generative framework that integrates a variational autoencoder (VAE) for upsampling band-limited RIRs and a conditional diffusion transformer model based on rectified flow matching for generating RIRs from natural language descriptions. The use of a caption-then-rewrite pipeline to generate diverse textual prompts from visual inputs is a notable strength, allowing for a more intuitive and creative interaction with the model. The architectural decoupling of the VAE and the diffusion transformer is a clever design choice that addresses the limitations of existing models effectively.
The empirical evaluation is robust, with a comprehensive dataset of 145,976 samples that enhances the model's training and validation. The paper presents both objective metrics (mean RT60 error) and subjective evaluations (human listener ratings), providing a well-rounded assessment of the model's performance. The results indicate significant improvements over existing methods, particularly in perceptual quality and acoustic accuracy, which are critical for practical applications in audio synthesis.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results.
One limitation is the reliance on the quality of the initial band-limited datasets, which could affect the overall performance of the model. Additionally, while the subjective evaluations show improvements over baselines, the small sample size and high variance in listener ratings suggest that further validation with larger participant groups is necessary to substantiate claims of superiority.
The potential applications of PromptReverb are significant, particularly in fields such as virtual reality, architectural acoustics, and audio production. By enabling intuitive natural language control over acoustic properties, the framework could democratize access to high-quality audio synthesis, making it more accessible to non-experts and enhancing creative workflows in various domains. The main contribution of this paper is the introduction of PromptReverb, a novel two-stage generative framework for room impulse response generation that leverages multimodal inputs and advanced machine learning techniques to overcome existing limitations in the field. This work significantly advances the state of the art in audio synthesis and has the potential to impact various practical applications in immersive audio environments.
Room impulse response (RIR) generation remains a critical challenge for creating immersive virtual acoustic environments. Current methods suffer from two fundamental limitations: the scarcity of full-band RIR datasets and the inability of existing models to generate acoustically accurate responses from diverse input modalities. We present PromptReverb, a two-stage generative framework that addresses these challenges. Our approach combines a variational autoencoder that upsamples band-limited RIRs to full-band quality (48 kHz), and a conditional diffusion transformer model based on rectified flow matching that generates RIRs from descriptions in natural language. Empirical evaluation demonstrates that PromptReverb produces RIRs with superior perceptual quality and acoustic accuracy compared to existing methods, achieving 8.8% mean RT60 error compared to -37% for widely used baselines and yielding more realistic room-acoustic parameters. Our method enables practical applications in virtual reality, architectural acoustics, and audio production where flexible, high-quality RIR synthesis is essential.
Primary: unknown
All Institutions: unknown
The paper presents PromptReverb, a two-stage framework for generating full-band room impulse responses from natural language descriptions, significantly advancing the field of audio synthesis. The combination of a VAE for upsampling and a conditional diffusion transformer for generation represents a novel approach that addresses critical limitations in existing methods, with promising implications for practical applications in immersive audio environments.
The proposed methodology in PromptReverb is innovative, combining a variational autoencoder (VAE) for upsampling band-limited room impulse responses (RIRs) and a conditional diffusion transformer model based on rectified flow matching for generating RIRs from natural language descriptions. The two-stage framework effectively addresses the limitations of existing methods, particularly the scarcity of high-quality datasets and the need for intuitive natural language conditioning. The use of a caption-then-rewrite pipeline to generate diverse textual prompts is a notable strength, allowing for a more flexible and user-friendly interface. However, the complexity of the architecture and the reliance on multiple models may pose challenges in terms of implementation and optimization.
The empirical evaluation of PromptReverb is thorough, showcasing significant improvements in perceptual quality and acoustic accuracy over existing methods. The authors provide quantitative metrics, such as the mean RT60 error, and qualitative assessments through subjective listener evaluations. The dataset used for training is extensive and diverse, which enhances the generalizability of the model. However, the paper could benefit from more detailed comparisons with a wider range of baseline methods and a clearer presentation of the experimental results.
While the paper outlines the architecture and training process in detail, it lacks specific implementation details that would facilitate reproducibility. There is no mention of code availability or supplementary materials that could aid other researchers in replicating the results. Providing a GitHub repository or similar would significantly enhance the reproducibility of the research.
One limitation of the study is the potential overfitting to the specific datasets used, particularly given the reliance on existing band-limited RIRs. Additionally, the subjective evaluation with a small number of participants may not fully capture the variability in listener preferences across diverse acoustic environments. The complexity of the model may also limit its accessibility for practical applications in real-time scenarios.
The ability to generate high-quality RIRs from natural language descriptions has significant implications for various fields, including virtual reality, architectural acoustics, and audio production. By enabling more intuitive control over acoustic properties, this research could enhance user experiences in immersive environments and facilitate creative audio design. The approach has the potential to democratize access to advanced audio synthesis techniques, making them more accessible to non-experts. The paper presents PromptReverb, a two-stage framework for generating full-band room impulse responses from natural language descriptions, significantly advancing the field of audio synthesis. The combination of a VAE for upsampling and a conditional diffusion transformer for generation represents a novel approach that addresses critical limitations in existing methods, with promising implications for practical applications in immersive audio environments.
Automated dysarthria detection and severity assessment from speech have attracted significant research attention due to their potential clinical impact. Despite rapid progress in acoustic modeling and deep learning, models still fall short of human expert performance. This manuscript provides a comprehensive analysis of the reasons behind this gap, emphasizing a conceptual divergence we term the ``perceptual-statistical gap''. We detail human expert perceptual processes, survey machine learning representations and methods, review existing literature on feature sets and modeling strategies, and present a theoretical analysis of limits imposed by label noise and inter-rater variability. We further outline practical strategies to narrow the gap, perceptually motivated features, self-supervised pretraining, ASR-informed objectives, multimodal fusion, human-in-the-loop training, and explainability methods. Finally, we propose experimental protocols and evaluation metrics aligned with clinical goals to guide future research toward clinically reliable and interpretable dysarthria assessment tools.
Primary: & Development Institute Bengaluru
All Institutions: & Development Institute Bengaluru
The main contribution of this paper is a comprehensive analysis of the limitations of current machine learning approaches in dysarthria assessment, highlighting the perceptual-statistical gap and proposing innovative strategies to bridge this divide. The analysis is significant as it addresses a critical need for more interpretable and clinically applicable automated assessment tools in the field of speech pathology.
The paper presents a thorough examination of the perceptual-statistical gap in dysarthria assessment, emphasizing the need for models to incorporate human-like perceptual processes. It proposes several innovative strategies, including perceptually motivated features, self-supervised learning, and human-in-the-loop training, which are well-grounded in existing literature. However, the methodology could benefit from clearer experimental validation of these strategies.
The paper lacks empirical results demonstrating the effectiveness of the proposed methods. While it outlines potential experimental protocols and evaluation metrics, it does not provide concrete experimental data or results to support its claims, which limits the assessment of its technical contributions.
The paper does not include sufficient implementation details or code repositories, which raises concerns about reproducibility. The absence of a clear methodology for replicating the proposed approaches is a significant drawback.
Key limitations include the lack of experimental validation, insufficient detail on how to implement the proposed methods, and the challenge of generalizing findings across diverse datasets. The paper also acknowledges the inherent variability in expert labeling, which complicates model training.
The proposed methodologies have the potential to significantly improve dysarthria assessment tools, making them more clinically relevant and interpretable. By addressing the perceptual-statistical gap, the work could enhance the reliability of automated assessments, ultimately benefiting patients and clinicians alike. The main contribution of this paper is a comprehensive analysis of the limitations of current machine learning approaches in dysarthria assessment, highlighting the perceptual-statistical gap and proposing innovative strategies to bridge this divide. The analysis is significant as it addresses a critical need for more interpretable and clinically applicable automated assessment tools in the field of speech pathology.
Conventional Convolutional Neural Networks (CNNs) in the real domain have been widely used for audio classification. However, their convolution operations process multi-channel inputs independently, limiting the ability to capture correlations among channels. This can lead to suboptimal feature learning, particularly for complex audio patterns such as multi-channel spectrogram representations. Quaternion Convolutional Neural Networks (QCNNs) address this limitation by employing quaternion algebra to jointly capture inter-channel dependencies, enabling more compact models with fewer learnable parameters while better exploiting the multi-dimensional nature of audio signals. However, QCNNs exhibit higher computational complexity due to the overhead of quaternion operations, resulting in increased inference latency and reduced efficiency compared to conventional CNNs, posing challenges for deployment on resource-constrained platforms. To address this challenge, this study explores knowledge distillation (KD) and pruning, to reduce the computational complexity of QCNNs while maintaining performance. Our experiments on audio classification reveal that pruning QCNNs achieves similar or superior performance compared to KD while requiring less computational effort. Compared to conventional CNNs and Transformer-based architectures, pruned QCNNs achieve competitive performance with a reduced learnable parameter count and computational complexity. On the AudioSet dataset, pruned QCNNs reduce computational cost by 50\% and parameter count by 80\%, while maintaining performance comparable to the conventional CNNs. Furthermore, pruned QCNNs generalize well across multiple audio classification benchmarks, including GTZAN for music genre recognition, ESC-50 for environmental sound classification and RAVDESS for speech emotion recognition.
Primary: Indraprastha Institute of Information Technology (IIIT) Delhi
All Institutions: Indraprastha Institute of Information Technology (IIIT) Delhi, Centre for Vision, Speech and Signal Processing, University of Surrey, King's College London, University of Surrey
The paper effectively addresses the computational challenges of QCNNs in audio classification by proposing a model compression pipeline that significantly enhances efficiency while maintaining performance. The innovative use of quaternion algebra combined with practical model compression techniques positions this work as a valuable contribution to the field of audio processing and machine learning.
The paper presents a well-structured methodology for compressing Quaternion Convolutional Neural Networks (QCNNs) through a combination of knowledge distillation and filter pruning. The use of quaternion algebra to capture inter-channel dependencies in audio signals is innovative, and the proposed model compression pipeline is clearly articulated. The experiments demonstrate the effectiveness of pruning over knowledge distillation, providing a solid basis for the claims made.
The experiments are comprehensive, utilizing multiple datasets (AudioSet, GTZAN, ESC-50, RAVDESS) to evaluate the performance of pruned QCNNs. The results indicate a significant reduction in computational cost and parameter count while maintaining competitive performance, which is a strong point of the paper. However, the paper could benefit from more detailed statistical analysis of the results.
The authors provide a GitHub repository for their code, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training configurations, which could hinder full reproducibility.
One limitation is the computational complexity associated with the initial QCNNs before pruning, which may not be suitable for all resource-constrained environments. Additionally, the performance metrics could be expanded to include more nuanced evaluations beyond mean Average Precision (mAP) and accuracy.
The work has significant implications for deploying efficient audio classification models in resource-constrained environments, contributing to the development of sustainable AI technologies. The reduction in energy consumption and carbon emissions is particularly relevant in today's context of environmental sustainability. The paper effectively addresses the computational challenges of QCNNs in audio classification by proposing a model compression pipeline that significantly enhances efficiency while maintaining performance. The innovative use of quaternion algebra combined with practical model compression techniques positions this work as a valuable contribution to the field of audio processing and machine learning.
We introduce HiFi-HARP, a large-scale dataset of 7th-order Higher-Order Ambisonic Room Impulse Responses (HOA-RIRs) consisting of more than 100,000 RIRs generated via a hybrid acoustic simulation in realistic indoor scenes. HiFi-HARP combines geometrically complex, furnished room models from the 3D-FRONT repository with a hybrid simulation pipeline: low-frequency wave-based simulation (finite-difference time-domain) up to 900 Hz is used, while high frequencies above 900 Hz are simulated using a ray-tracing approach. The combined raw RIRs are encoded into the spherical-harmonic domain (AmbiX ACN) for direct auralization. Our dataset extends prior work by providing 7th-order Ambisonic RIRs that combine wave-theoretic accuracy with realistic room content. We detail the generation pipeline (scene and material selection, array design, hybrid simulation, ambisonic encoding) and provide dataset statistics (room volumes, RT60 distributions, absorption properties). A comparison table highlights the novelty of HiFi-HARP relative to existing RIR collections. Finally, we outline potential benchmarks such as FOA-to-HOA upsampling, source localization, and dereverberation. We discuss machine learning use cases (spatial audio rendering, acoustic parameter estimation) and limitations (e.g., simulation approximations, static scenes). Overall, HiFi-HARP offers a rich resource for developing spatial audio and acoustics algorithms in complex environments.
Primary: Leibniz University Hannover
All Institutions: Leibniz University Hannover
The main contribution of this paper is the introduction of HiFi-HARP, a comprehensive dataset of high-fidelity 7th-order Ambisonic Room Impulse Responses, which significantly enhances the resources available for research in spatial audio and acoustics. The combination of advanced simulation techniques and realistic room modeling positions this dataset as a valuable tool for advancing machine learning applications in the field.
The methodology presented in HiFi-HARP is robust and well-structured, utilizing a hybrid simulation approach that combines low-frequency wave-based methods with high-frequency ray tracing. The use of the 3D-FRONT dataset for realistic room geometries and the detailed acoustic modeling through semantic material mapping demonstrates a thoughtful integration of existing resources to create a high-fidelity dataset. The authors provide a clear description of the simulation pipeline, including microphone array design and the hybrid simulation process, which enhances the credibility of their approach. However, while the methodology is innovative, it could benefit from further validation against real-world measurements to address the limitations of the simulation.
The experimental evaluation is thorough, showcasing the dataset's utility through two downstream tasks: room acoustic parameter estimation and direction-of-arrival estimation. The results indicate significant improvements in model performance when using the HiFi-HARP dataset, validating its effectiveness for data augmentation and training spatial audio algorithms. The inclusion of empirical results strengthens the paper, demonstrating the practical applications of the dataset in real-world scenarios. However, more extensive comparisons with existing datasets could provide deeper insights into its advantages.
The paper provides sufficient detail regarding the dataset creation process and the simulation methods used, which aids in reproducibility. The authors describe their custom pipeline for optimizing simulation runs, which is a valuable contribution for researchers looking to replicate or build upon their work. However, the lack of specific implementation details regarding the machine learning models used in the downstream tasks may hinder full reproducibility.
The authors acknowledge several limitations, including the deterministic nature of the simulations, which may not capture all real-world acoustic phenomena, and the static nature of the scenes. Additionally, the reliance on semantic labels for material properties may introduce inaccuracies. Future work could address these limitations by incorporating dynamic elements and refining material estimation methods.
HiFi-HARP has significant potential implications for the fields of spatial audio and acoustics, particularly in applications such as virtual reality, augmented reality, and immersive audio rendering. By providing a high-quality dataset that supports the development of advanced algorithms, it can facilitate research in areas like sound field capture, acoustic parameter estimation, and machine learning applications in audio processing. The dataset's availability can also encourage further exploration and innovation in spatial audio technologies. The main contribution of this paper is the introduction of HiFi-HARP, a comprehensive dataset of high-fidelity 7th-order Ambisonic Room Impulse Responses, which significantly enhances the resources available for research in spatial audio and acoustics. The combination of advanced simulation techniques and realistic room modeling positions this dataset as a valuable tool for advancing machine learning applications in the field.
Vocal recordings on consumer devices commonly suffer from multiple concurrent degradations: noise, reverberation, band-limiting, and clipping. We present Smule Renaissance Small (SRS), a compact single-stage model that performs end-to-end vocal restoration directly in the complex STFT domain. By incorporating phase-aware losses, SRS enables large analysis windows for improved frequency resolution while achieving 10.5x real-time inference on iPhone 12 CPU at 48 kHz. On the DNS 5 Challenge blind set, despite no speech training, SRS outperforms a strong GAN baseline and closely matches a computationally expensive flow-matching system. To enable evaluation under realistic multi-degradation scenarios, we introduce the Extreme Degradation Bench (EDB): 87 singing and speech recordings captured under severe acoustic conditions. On EDB, SRS surpasses all open-source baselines on singing and matches commercial systems, while remaining competitive on speech despite no speech-specific training. We release both SRS and EDB under the MIT License.
Primary: Smule Labs
All Institutions: Smule Labs
The main contribution of this paper is the introduction of Smule Renaissance Small (SRS), a novel single-stage model for efficient vocal restoration that operates directly in the complex STFT domain, achieving competitive performance against more complex systems while being optimized for real-time inference on consumer devices. This work represents a significant advancement in the field of audio processing, particularly for applications requiring robust restoration under challenging acoustic conditions.
The methodology presented in the paper is innovative, leveraging a compact single-stage model that operates directly in the complex STFT domain, which is a departure from traditional two-stage approaches. The incorporation of phase-aware losses and a band-split generator allows for improved frequency resolution and efficiency, making the model suitable for real-time applications on consumer devices. The design choices, such as the general-purpose corruption module and the use of a temporal-convolutional backbone, demonstrate a thoughtful approach to addressing the challenges of multi-degradation vocal restoration.
The experimental evaluation is robust, utilizing both objective metrics and subjective assessments to compare the proposed SRS model against existing systems. The introduction of the Extreme Degradation Bench (EDB) provides a valuable resource for evaluating vocal restoration under realistic conditions. The results indicate that SRS outperforms several strong baselines, which is a significant achievement given its single-stage architecture and lack of speech-specific training.
The authors have committed to releasing both the SRS model and the EDB dataset under the MIT License, which enhances reproducibility. However, the paper could benefit from more detailed implementation specifics, such as hyperparameter settings and training procedures, to facilitate independent replication of results.
One limitation is that while SRS performs well on singing restoration, its performance on speech restoration is comparatively weaker, particularly given that it was not trained specifically on speech data. Additionally, the reliance on consumer devices for inference may limit the model's applicability in more demanding professional settings.
The potential applications of SRS are significant, particularly in the context of mobile audio processing, where real-time vocal restoration can enhance user experiences in various consumer applications, such as music production and podcasting. The release of the EDB dataset also encourages further research in the domain of audio restoration, paving the way for advancements in this area. The main contribution of this paper is the introduction of Smule Renaissance Small (SRS), a novel single-stage model for efficient vocal restoration that operates directly in the complex STFT domain, achieving competitive performance against more complex systems while being optimized for real-time inference on consumer devices. This work represents a significant advancement in the field of audio processing, particularly for applications requiring robust restoration under challenging acoustic conditions.
Virtual instrument generation requires maintaining consistent timbre across different pitches and velocities, a challenge that existing note-level models struggle to address. We present FlowSynth, which combines distributional flow matching (DFM) with test-time optimization for high-quality instrument synthesis. Unlike standard flow matching that learns deterministic mappings, DFM parameterizes the velocity field as a Gaussian distribution and optimizes via negative log-likelihood, enabling the model to express uncertainty in its predictions. This probabilistic formulation allows principled test-time search: we sample multiple trajectories weighted by model confidence and select outputs that maximize timbre consistency. FlowSynth outperforms the current state-of-the-art TokenSynth baseline in both single-note quality and cross-note consistency. Our approach demonstrates that modeling predictive uncertainty in flow matching, combined with music-specific consistency objectives, provides an effective path to professional-quality virtual instruments suitable for real-time performance.
Primary: unknown
All Institutions: unknown
FlowSynth introduces a novel framework for virtual instrument generation that leverages Distributional Flow Matching to achieve superior timbre consistency and audio quality. The combination of probabilistic modeling and test-time optimization represents a significant advancement in the field, with the potential to impact both academic research and practical applications in music production.
The methodology presented in FlowSynth is innovative, particularly in its use of Distributional Flow Matching (DFM) to model velocity fields as probabilistic distributions rather than deterministic mappings. This approach allows for the expression of uncertainty in predictions, which is crucial for generating consistent timbres across different pitches and velocities. The combination of DFM with a test-time optimization framework that employs confidence-weighted sampling is a notable advancement. However, while the methodology is well-justified and theoretically sound, the paper could benefit from a more detailed discussion of the computational complexity introduced by the probabilistic approach and how it compares to traditional methods.
The experimental evaluation is robust, utilizing the NSynth dataset to train and test the model. The authors provide comprehensive metrics to assess audio quality, pitch accuracy, prompt adherence, and timbre consistency, which are critical for evaluating the performance of generative audio models. The results demonstrate that FlowSynth outperforms the baseline TokenSynth across various metrics, particularly in timbre consistency and audio quality. However, the paper could enhance its credibility by including more extensive qualitative evaluations, such as user studies or expert reviews of the generated audio.
The paper outlines the training and inference configurations in detail, including the optimizer settings, batch sizes, and the architecture of the model. However, the lack of a publicly available code repository or demo URL limits reproducibility. Providing access to the model and training scripts would significantly enhance the ability for other researchers to replicate the findings and build upon this work.
One limitation of the study is the reliance on a single dataset (NSynth), which may not fully capture the diversity of timbres and instruments in real-world applications. Additionally, while the probabilistic approach allows for uncertainty modeling, it may introduce additional computational overhead, which could be a barrier for real-time applications. The paper does not address how the model performs with instruments outside the piano family, which could limit its applicability.
The implications of this research are significant for the field of music production and AI-generated audio. By improving timbre consistency in virtual instruments, FlowSynth could enhance the quality of AI-generated music, making it more viable for professional use. This advancement could democratize music production, allowing more creators access to high-quality instrument synthesis without the need for expensive hardware or extensive sample libraries. Furthermore, the techniques developed could inspire future research in generative modeling across various domains beyond audio. FlowSynth introduces a novel framework for virtual instrument generation that leverages Distributional Flow Matching to achieve superior timbre consistency and audio quality. The combination of probabilistic modeling and test-time optimization represents a significant advancement in the field, with the potential to impact both academic research and practical applications in music production.
This paper presents PhoenixCodec, a comprehensive neural speech coding and decoding framework designed for extremely low-resource conditions. The proposed system integrates an optimized asymmetric frequency-time architecture, a Cyclical Calibration and Refinement (CCR) training strategy, and a noise-invariant fine-tuning procedure. Under stringent constraints - computation below 700 MFLOPs, latency less than 30 ms, and dual-rate support at 1 kbps and 6 kbps - existing methods face a trade-off between efficiency and quality. PhoenixCodec addresses these challenges by alleviating the resource scattering of conventional decoders, employing CCR to escape local optima, and enhancing robustness through noisy-sample fine-tuning. In the LRAC 2025 Challenge Track 1, the proposed system ranked third overall and demonstrated the best performance at 1 kbps in both real-world noise and reverberation and intelligibility in clean tests, confirming its effectiveness.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PhoenixCodec, a neural speech coding framework that effectively balances efficiency and quality under extreme low-resource conditions. This work is significant as it addresses a critical gap in the field of speech processing, providing a robust solution for real-world applications where resources are limited.
The methodology presented in PhoenixCodec is innovative, particularly the integration of the asymmetric frequency-time architecture and the Cyclical Calibration and Refinement (CCR) training strategy. The approach of addressing resource constraints while maintaining speech quality is commendable. The noise-invariant fine-tuning procedure adds robustness to the model, which is crucial for real-world applications. However, the paper could benefit from a more detailed explanation of the CCR strategy and its implementation, as well as comparisons with baseline methods.
The experimental evaluation is thorough, showcasing the performance of PhoenixCodec in the LRAC 2025 Challenge. The results indicate that the system excels in low-bitrate scenarios, particularly at 1 kbps, where it outperforms existing methods in both real-world noise and reverberation conditions. The metrics used for evaluation are appropriate, but additional comparisons with a broader range of existing models could strengthen the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. Key aspects of the training process, including hyperparameters and dataset specifics, are not clearly outlined. Providing a code repository or supplementary materials would significantly enhance reproducibility.
One limitation is the focus on extremely low-resource scenarios, which may not generalize well to other applications requiring higher fidelity. Additionally, while the performance at 1 kbps is impressive, the trade-offs at higher bitrates are not discussed in detail. The paper could also explore the model's performance across various languages and dialects to assess its robustness further.
The potential applications of PhoenixCodec are significant, particularly in telecommunications, assistive technologies, and any domain where low-bandwidth communication is essential. The ability to maintain intelligibility in challenging conditions could improve accessibility for users in resource-constrained environments. The main contribution of this paper is the introduction of PhoenixCodec, a neural speech coding framework that effectively balances efficiency and quality under extreme low-resource conditions. This work is significant as it addresses a critical gap in the field of speech processing, providing a robust solution for real-world applications where resources are limited.
Existing pitch curve generators face two main challenges: they often neglect singer-specific expressiveness, reducing their ability to capture individual singing styles. And they are typically developed as auxiliary modules for specific tasks such as pitch correction, singing voice synthesis, or voice conversion, which restricts their generalization capability. We propose StylePitcher, a general-purpose pitch curve generator that learns singer style from reference audio while preserving alignment with the intended melody. Built upon a rectified flow matching architecture, StylePitcher flexibly incorporates symbolic music scores and pitch context as conditions for generation, and can seamlessly adapt to diverse singing tasks without retraining. Objective and subjective evaluations across various singing tasks demonstrate that StylePitcher improves style similarity and audio quality while maintaining pitch accuracy comparable to task-specific baselines.
Primary: unknown
All Institutions: unknown
StylePitcher represents a significant advancement in pitch curve generation for singing tasks, combining innovative methodologies with robust experimental validation to address the challenges of style expressiveness and generalization across diverse applications.
The methodology presented in StylePitcher is innovative, employing a rectified flow matching architecture to generate pitch curves that are both style-following and expressive. The formulation of pitch curve generation as a masked infilling problem is particularly noteworthy, allowing for implicit style modeling without explicit singer labels. The use of symbolic music scores and pitch context as conditions for generation enhances the model's flexibility across various singing tasks. The introduction of a smoothing algorithm for data annotation is also a significant contribution, addressing the challenge of manual annotations in pitch generation tasks.
The experiments are well-structured, utilizing two multi-speaker singing datasets totaling 1916 hours of singing voice. The evaluation metrics include both objective measures (e.g., pitch alignment and similarity) and subjective assessments (mean opinion scores for pitch, style, and quality). The results indicate that StylePitcher outperforms existing baselines in terms of style similarity and audio quality while maintaining comparable pitch accuracy. The comprehensive evaluation across different singing tasks demonstrates the model's versatility and effectiveness.
The paper provides sufficient details regarding the model architecture, training process, and evaluation metrics, which would allow for reproducibility. However, the lack of a publicly available code repository may hinder broader reproducibility efforts. The authors do mention the use of specific datasets and preprocessing techniques, which are essential for replicating the results.
One notable limitation is the occasional production of unnatural results when applying expressive techniques without content awareness. This suggests that while the model excels in style capture, it may struggle with maintaining musical coherence in some scenarios. Additionally, the reliance on subjective evaluations may introduce variability in the assessment of performance.
The potential applications of StylePitcher are significant, as it enables expressive singing applications across various domains, including music production, entertainment, and education. By allowing for style preservation and transfer in singing tasks, it opens avenues for personalized music experiences and could enhance tools for musicians and vocalists. The plug-and-play nature of the model also suggests that it could be integrated into existing systems without extensive retraining, making it accessible for broader use. StylePitcher represents a significant advancement in pitch curve generation for singing tasks, combining innovative methodologies with robust experimental validation to address the challenges of style expressiveness and generalization across diverse applications.
Neural Audio Codecs (NACs) have gained growing attention in recent years as technologies for audio compression and audio representation in speech language models. While mainstream NACs typically require G-level computation and M-level parameters, the performance of lightweight and streaming NACs remains underexplored. This paper proposes SpecTokenizer, a lightweight streaming codec that operates in the compressed spectral domain. Composed solely of alternating CNN and RNN layers, SpecTokenizer achieves greater efficiency and better representational capability through multi-scale modeling in the compressed spectrum domain. At 4 kbps, the proposed SpecTokenizer achieves comparable or superior performance compared to the codec with state-of-the-art lightweight architecture while requiring only 20% of the computation and 10% of the parameters. Furthermore, it significantly outperforms the codec when using similar computational and storage resources.
Primary: Anker Innovations
All Institutions: Anker Innovations
The main contribution of this paper is the introduction of SpecTokenizer, a lightweight streaming audio codec that significantly reduces computational requirements while maintaining or improving performance compared to existing architectures. This work represents a meaningful advancement in the field of audio processing, particularly in the context of neural audio codecs, and addresses a critical gap in the literature regarding lightweight and efficient models.
The methodology employed in SpecTokenizer is innovative, utilizing a combination of CNN and RNN layers to create a lightweight streaming audio codec. The focus on multi-scale modeling in the compressed spectral domain is a noteworthy contribution, as it addresses the computational efficiency challenges faced by existing neural audio codecs. The paper provides a clear description of the architecture and the rationale behind the design choices, which enhances the understanding of how the proposed approach improves upon existing methods.
The experimental evaluation is robust, with comprehensive comparisons against state-of-the-art lightweight architectures. The results demonstrate that SpecTokenizer achieves superior performance at a significantly reduced computational cost. The use of 4 kbps as a benchmark is relevant for practical applications, and the paper includes sufficient details about the datasets and metrics used for evaluation, allowing for a clear assessment of the codec's performance.
While the paper outlines the architecture and experimental setup, it lacks specific implementation details that would facilitate reproducibility. There are no links to code repositories or demo pages, which is a significant drawback for researchers looking to replicate the results or build upon this work. Providing access to the implementation would greatly enhance the reproducibility of the findings.
One limitation of the study is the lack of a detailed exploration of the trade-offs between compression efficiency and audio quality at different bit rates. Additionally, the paper does not address potential challenges in real-world applications, such as latency and adaptability to various audio types. The focus on a single bitrate (4 kbps) may limit the generalizability of the findings.
The implications of SpecTokenizer are significant for the field of audio processing, particularly in applications where bandwidth is limited, such as mobile communications and streaming services. The lightweight nature of the codec makes it suitable for deployment in resource-constrained environments, potentially leading to broader adoption of neural audio codecs in practical applications. The main contribution of this paper is the introduction of SpecTokenizer, a lightweight streaming audio codec that significantly reduces computational requirements while maintaining or improving performance compared to existing architectures. This work represents a meaningful advancement in the field of audio processing, particularly in the context of neural audio codecs, and addresses a critical gap in the literature regarding lightweight and efficient models.
The rapid advancement of next-token-prediction models has led to widespread adoption across modalities, enabling the creation of realistic synthetic media. In the audio domain, while autoregressive speech models have propelled conversational interactions forward, the potential for misuse, such as impersonation in phishing schemes or crafting misleading speech recordings, has also increased. Security measures such as watermarking have thus become essential to ensuring the authenticity of digital media. Traditional statistical watermarking methods used for autoregressive language models face challenges when applied to autoregressive audio models, due to the inevitable ``retokenization mismatch'' - the discrepancy between original and retokenized discrete audio token sequences. To address this, we introduce Aligned-IS, a novel, distortion-free watermark, specifically crafted for audio generation models. This technique utilizes a clustering approach that treats tokens within the same cluster equivalently, effectively countering the retokenization mismatch issue. Our comprehensive testing on prevalent audio generation platforms demonstrates that Aligned-IS not only preserves the quality of generated audio but also significantly improves the watermark detectability compared to the state-of-the-art distortion-free watermarking adaptations, establishing a new benchmark in secure audio technology applications.
Primary: unknown
All Institutions: unknown
The paper presents Aligned-IS, a novel distortion-free watermarking framework for autoregressive audio generation models, significantly improving watermark detectability while preserving audio quality. The methodology is innovative and well-supported by empirical evidence, making a meaningful contribution to the field of secure audio technology.
The proposed methodology introduces a novel watermarking technique called Aligned-IS, which effectively addresses the retokenization mismatch issue inherent in autoregressive audio generation models. By employing a clustering approach, the method treats tokens within the same cluster as equivalent, enhancing watermark detectability without compromising audio quality. The use of aligned inverse sampling as a distortion-free watermarking strategy is innovative and well-justified, providing a significant advancement over existing methods. The clustering strategy is empirically validated, demonstrating its effectiveness in reducing token mismatch rates.
The experimental evaluation is comprehensive, utilizing various audio generation models and datasets to assess the performance of Aligned-IS against state-of-the-art watermarking techniques. The results show a marked improvement in detectability and robustness, with rigorous statistical analysis supporting the findings. The experiments also include a thorough evaluation of audio quality, ensuring that the watermarking does not degrade the generated audio. However, the paper lacks detailed information on the specific datasets used and their characteristics.
The paper provides a GitHub repository for the code, which is a positive aspect for reproducibility. However, the experimental details could be more explicit, particularly regarding hyperparameters and the specific configurations used for each model. The inclusion of a clear description of the experimental setup would enhance reproducibility further.
The paper acknowledges that the clustering method may not capture all forms of retokenization errors, particularly with new audio patterns or speech artifacts. Additionally, the necessity for clustering with each new model introduces computational overhead. These limitations could affect the generalizability of the approach across different audio generation models.
The proposed watermarking framework has significant implications for the security and authenticity of AI-generated audio, addressing concerns related to misinformation and digital rights management. By enhancing the traceability of synthetic audio, it contributes to the broader discourse on responsible AI use. However, the potential for misuse in creating deceptive audio content remains a concern, highlighting the need for ongoing research in watermarking technologies. The paper presents Aligned-IS, a novel distortion-free watermarking framework for autoregressive audio generation models, significantly improving watermark detectability while preserving audio quality. The methodology is innovative and well-supported by empirical evidence, making a meaningful contribution to the field of secure audio technology.
Recent speech-to-speech (S2S) models generate intelligible speech but still lack natural expressiveness, largely due to the absence of a reliable evaluation metric. Existing approaches, such as subjective MOS ratings, low-level acoustic features, and emotion recognition are costly, limited, or incomplete. To address this, we present DeEAR (Decoding the Expressive Preference of eAR), a framework that converts human preference for speech expressiveness into an objective score. Grounded in phonetics and psychology, DeEAR evaluates speech across three dimensions: Emotion, Prosody, and Spontaneity, achieving strong alignment with human perception (Spearman's Rank Correlation Coefficient, SRCC = 0.86) using fewer than 500 annotated samples. Beyond reliable scoring, DeEAR enables fair benchmarking and targeted data curation. It not only distinguishes expressiveness gaps across S2S models but also selects 14K expressive utterances to form ExpressiveSpeech, which improves the expressive score (from 2.0 to 23.4 on a 100-point scale) of S2S models. Demos and codes are available at https://github.com/FreedomIntelligence/ExpressiveSpeech
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of DeEAR, a novel framework for objectively measuring speech expressiveness, which significantly enhances the evaluation and development of speech synthesis models. This work represents a meaningful advancement in the field, addressing a critical gap in the evaluation of speech quality and expressiveness, and offers a scalable solution for improving conversational AI systems.
The methodology presented in this paper is robust and innovative, leveraging principles from phonetics and psychology to create a comprehensive framework for evaluating speech expressiveness. The decomposition of expressiveness into three dimensions (Emotion, Prosody, and Spontaneity) is well-justified and allows for a nuanced understanding of speech quality. The use of specialized models for each dimension, along with a fusion function to combine scores, demonstrates a thoughtful approach to addressing the complexity of human expressiveness. The framework's reliance on a limited dataset for training while achieving high correlation with human perception is particularly commendable, showcasing efficiency in data usage.
The experiments conducted validate the effectiveness of the DeEAR framework. The strong alignment with human perception, evidenced by high Spearman's and Pearson correlation coefficients, supports the reliability of the proposed metric. Additionally, the application of DeEAR for automated benchmarking of S2S models and data curation is well-executed, with clear improvements in expressiveness scores after fine-tuning with the curated dataset. The use of diverse audio samples for testing enhances the generalizability of the results, although more extensive testing across varied domains could strengthen the findings.
The paper provides sufficient detail regarding the methodology and experiments, including the datasets used and the model architectures. The availability of code and demos on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work. However, the paper could benefit from clearer descriptions of hyperparameters and training protocols to facilitate easier replication of results.
One limitation is the reliance on a relatively small annotated dataset for training, which may affect the generalizability of the model across different languages and dialects. Additionally, while the framework is designed to be efficient, the complexity of the model may pose challenges in real-time applications. The subjective nature of expressiveness also means that the metric may not capture all nuances of human perception, particularly in diverse cultural contexts.
The implications of this research are significant for the fields of speech synthesis and conversational AI. By providing a reliable metric for expressiveness, DeEAR can enhance the development of more engaging and human-like speech systems, with applications in voice assistants, gaming, and mental health support. The creation of the ExpressiveSpeech dataset also contributes to the availability of high-quality resources for training expressive models, potentially advancing research in this area. The main contribution of this paper is the introduction of DeEAR, a novel framework for objectively measuring speech expressiveness, which significantly enhances the evaluation and development of speech synthesis models. This work represents a meaningful advancement in the field, addressing a critical gap in the evaluation of speech quality and expressiveness, and offers a scalable solution for improving conversational AI systems.
Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Israel Institute of Technology, University of Haifa
The main contribution of this paper is the introduction of a reference-free evaluation framework for speech separation that effectively predicts audio quality and intelligibility metrics using self-supervised learning. This work significantly advances the field by addressing the limitations of traditional evaluation methods and providing a robust solution for real-world applications.
The proposed methodology introduces a novel reference-free evaluation framework for speech separation that leverages self-supervised learning (SSL) representations. The architecture is designed to predict joint audio quality and intelligibility metrics (SI-SNR and WER) without relying on matched references, which is a significant advancement in the field. The use of multiple speech separation systems to create a diverse training dataset enhances the robustness of the approach. The model architecture, which includes a transformer encoder and multi-output regression, is well-structured and addresses the limitations of existing methods.
The experiments conducted on the WHAMR! dataset provide a solid basis for evaluating the proposed framework. The reported results, including MAE and PCC for both WER and SI-SNR estimations, demonstrate that the model outperforms existing baselines. The use of various SSL representations and the evaluation on a real-world dataset (REAL-M) further validate the effectiveness of the approach. However, the paper could benefit from additional comparisons with more baseline methods to strengthen its claims.
The paper outlines the training setup, including the architecture, loss functions, and data processing pipeline, which is crucial for reproducibility. However, the lack of a publicly available code repository or demo limits the ability for others to replicate the results fully. Providing access to the trained models and datasets would enhance reproducibility.
The paper acknowledges several limitations, including the potential lack of generalization due to the use of the same dataset for training and testing. The uniform distribution of scores in the training data may not align well with out-of-domain data, which could affect performance. Additionally, the reliance on large GPUs for inference poses practical challenges for deployment.
The proposed framework has significant implications for real-world applications in speech processing, particularly in automatic speech recognition and other related fields. By enabling reference-free evaluation, it opens up new avenues for assessing speech separation systems in diverse environments, potentially improving the robustness and usability of these technologies in everyday applications. The main contribution of this paper is the introduction of a reference-free evaluation framework for speech separation that effectively predicts audio quality and intelligibility metrics using self-supervised learning. This work significantly advances the field by addressing the limitations of traditional evaluation methods and providing a robust solution for real-world applications.
Achieving immersive auditory experiences in virtual environments requires flexible sound modeling that supports dynamic source positions. In this paper, we introduce a task called resounding, which aims to estimate room impulse responses at arbitrary emitter location from a sparse set of measured emitter positions, analogous to the relighting problem in vision. We leverage the reciprocity property and introduce Versa, a physics-inspired approach to facilitating acoustic field learning. Our method creates physically valid samples with dense virtual emitter positions by exchanging emitter and listener poses. We also identify challenges in deploying reciprocity due to emitter/listener gain patterns and propose a self-supervised learning approach to address them. Results show that Versa substantially improve the performance of acoustic field learning on both simulated and real-world datasets across different metrics. Perceptual user studies show that Versa can greatly improve the immersive spatial sound experience. Code, dataset and demo videos are available on the project website: https://waves.seas.upenn.edu/projects/versa.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of Versa, a physics-inspired method for acoustic field learning that utilizes reciprocity to enhance immersive auditory experiences in virtual environments. This work represents a significant advancement in the field of audio machine learning, combining theoretical rigor with practical applications and addressing key challenges in sound modeling.
The paper introduces a novel approach called Versa that leverages the reciprocity property in acoustics to estimate room impulse responses from sparse emitter positions. The methodology is well-founded in physics, providing a solid theoretical basis for the proposed techniques. The use of self-supervised learning to tackle emitter/listener gain challenges is particularly innovative, as it addresses practical issues in deploying the reciprocity principle in real-world scenarios. The methodology is clearly articulated, with a logical progression from problem statement to solution.
The experiments are comprehensive, utilizing both simulated and real-world datasets to validate the effectiveness of the proposed method. The metrics used for evaluation are appropriate and demonstrate significant improvements over baseline methods. The perceptual user studies add a valuable dimension to the evaluation, providing insights into the subjective quality of the immersive sound experience. However, more details on the datasets and experimental setup could enhance the clarity of the results.
The authors provide a project website with code, datasets, and demo videos, which is a positive aspect for reproducibility. However, the paper could benefit from a more detailed description of the implementation specifics and hyperparameter settings used in the experiments to facilitate easier replication by other researchers.
One limitation noted is the potential complexity of the self-supervised learning approach, which may require careful tuning and may not generalize well across all acoustic environments. Additionally, while the results are promising, the paper does not extensively discuss the scalability of the method to larger or more complex environments.
The proposed method has significant implications for various applications in virtual reality, gaming, and immersive audio experiences. By improving the accuracy of sound modeling in dynamic environments, this research could enhance user experiences in entertainment, education, and training simulations. The focus on self-supervised learning also aligns with current trends in machine learning, potentially influencing future research directions in audio processing. The main contribution of this paper is the introduction of Versa, a physics-inspired method for acoustic field learning that utilizes reciprocity to enhance immersive auditory experiences in virtual environments. This work represents a significant advancement in the field of audio machine learning, combining theoretical rigor with practical applications and addressing key challenges in sound modeling.
In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency ($F_0$) perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate DNSMOS-filtered separated vocals and public singing corpora, enabling the model to preserve speaker timbre while capturing singing style nuances. Third, we integrate the Neural Source-Filter (NSF) model to explicitly represent harmonic and noise components, enhancing the naturalness and controllability of converted singing. R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
Primary: AI Lab
All Institutions: AI Lab
The main contribution of this paper is the R2-SVC framework, which effectively addresses the challenges of real-world singing voice conversion by enhancing robustness and expressiveness through innovative methodologies. The comprehensive analysis of technical contributions, methodology, and significance to the field underscores the potential impact of this work on advancing audio processing technologies.
The proposed R2-SVC framework introduces a multi-faceted approach to singing voice conversion by integrating simulation-based robustness enhancements, a singing-informed timbre and style extractor, and a Neural Source-Filter model. The use of random $F_0$ perturbations and wet sound simulations effectively addresses the challenges of real-world noise and reverberation, which are often overlooked in traditional SVC methods. The methodology is well-structured and demonstrates a clear progression from problem identification to solution implementation, with a strong emphasis on robustness and expressiveness.
The experiments are comprehensive, utilizing both objective and subjective evaluation metrics across multiple datasets. The inclusion of a hard test set that mimics real-world conditions adds significant value to the evaluation, showcasing the model's performance under challenging scenarios. The results indicate that R2-SVC outperforms existing methods, with detailed ablation studies highlighting the contributions of each component of the framework. However, the paper could benefit from more extensive comparisons with additional state-of-the-art methods.
The paper provides sufficient implementation details, including model architecture, training parameters, and data augmentation strategies, which facilitate reproducibility. The use of open-source components like Seed-VC is a positive aspect, although the lack of a public code repository limits the ease of replication for other researchers.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of singing styles and conditions encountered in real-world applications. Additionally, while the model shows improvements in robustness and naturalness, the paper does not address potential computational efficiency or real-time application challenges.
The advancements in robust and expressive singing voice conversion have significant implications for various applications, including music production, dubbing, and voice synthesis technologies. The ability to handle noisy environments enhances the practicality of SVC systems, potentially leading to broader adoption in commercial and artistic contexts. The main contribution of this paper is the R2-SVC framework, which effectively addresses the challenges of real-world singing voice conversion by enhancing robustness and expressiveness through innovative methodologies. The comprehensive analysis of technical contributions, methodology, and significance to the field underscores the potential impact of this work on advancing audio processing technologies.
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.
Primary: The Hong Kong Polytechnic University
All Institutions: NEC Corporation, The Hong Kong Polytechnic University, Wuhan University of Technology
The paper presents SimWhisper-Codec, a novel low-bitrate speech codec that effectively balances semantic preservation and acoustic fidelity through architectural simplification of the Whisper model. The contributions are significant, addressing a critical challenge in speech coding and demonstrating strong empirical results that could influence future research in the field.
The paper introduces SimWhisper-Codec, which innovatively simplifies the Whisper architecture to balance semantic and acoustic fidelity without external supervision. The targeted architectural simplifications, such as the removal of GELU activations and absolute positional encodings, are well-justified and empirically validated through rigorous analysis. The methodology is sound, leveraging existing models in a novel way to address a significant challenge in speech coding.
The experiments are comprehensive, utilizing a well-known dataset (LibriSpeech) and a variety of metrics to evaluate both semantic preservation and acoustic quality. The results are compelling, showing that SimWhisper-Codec outperforms existing methods in key areas, which strengthens the claims made in the paper. However, the lack of extensive comparisons with a wider range of state-of-the-art codecs could be seen as a limitation.
The paper provides sufficient implementation details, including architecture specifications and training procedures, which enhance reproducibility. The availability of the code on GitHub further supports this aspect. However, the paper could benefit from including more detailed hyperparameter settings and training configurations.
While the paper presents a strong case for the proposed codec, it does not extensively explore the limitations of the approach, such as potential trade-offs in performance at even lower bitrates or the impact of the simplified architecture on other speech tasks. Additionally, the reliance on a single dataset for evaluation may limit the generalizability of the findings.
The proposed codec has significant implications for low-bitrate speech applications, particularly in scenarios where bandwidth is limited, such as mobile communications or real-time translation systems. The approach could inspire further research into semantic-first methodologies in speech processing, potentially leading to advancements in related areas like voice synthesis and recognition. The paper presents SimWhisper-Codec, a novel low-bitrate speech codec that effectively balances semantic preservation and acoustic fidelity through architectural simplification of the Whisper model. The contributions are significant, addressing a critical challenge in speech coding and demonstrating strong empirical results that could influence future research in the field.
Electrophysiological (ExG) signals offer valuable insights into human physiology, yet building foundation models that generalize across everyday tasks remains challenging due to two key limitations: (i) insufficient data diversity, as most ExG recordings are collected in controlled labs with bulky, expensive devices; and (ii) task-specific model designs that require tailored processing (i.e., targeted frequency filters) and architectures, which limit generalization across tasks. To address these challenges, we introduce an approach for scalable, task-agnostic ExG monitoring in the wild. We collected 50 hours of unobtrusive free-living ExG data with an earphone-based hardware prototype to narrow the data diversity gap. At the core of our approach is Physiology-informed Multi-band Tokenization (PiMT), which decomposes ExG signals into 12 physiology-informed tokens, followed by a reconstruction task to learn robust representations. This enables adaptive feature recognition across the full frequency spectrum while capturing task-relevant information. Experiments on our new DailySense dataset-the first to enable ExG-based analysis across five human senses-together with four public ExG benchmarks, demonstrate that PiMT consistently outperforms state-of-the-art methods across diverse tasks.
Primary: Microsoft Research
All Institutions: Microsoft Research
The main contribution of this paper is the introduction of a novel approach for scalable, task-agnostic ExG monitoring using physiology-informed tokenization, which significantly enhances the generalization of ExG models across diverse tasks. The research addresses critical limitations in existing methodologies and presents a promising direction for future work in the field of machine learning and physiological signal processing.
The proposed methodology, Physiology-informed Multi-band Tokenization (PiMT), represents a significant advancement in processing ExG signals. By decomposing signals into physiology-informed tokens, the approach allows for adaptive feature recognition across a wide frequency spectrum, which is crucial for capturing task-relevant information. The methodology is well-justified, addressing the limitations of previous task-specific models and enhancing generalization capabilities. However, the paper could benefit from a more detailed explanation of the tokenization process and its physiological basis.
The experiments conducted on the DailySense dataset and four public ExG benchmarks demonstrate the effectiveness of the proposed method. The paper reports consistent improvements over state-of-the-art methods across diverse tasks, which is a strong indicator of the technical impact. However, the evaluation metrics and statistical significance of the results could be elaborated upon to strengthen the claims made.
The paper lacks sufficient details regarding the implementation of the proposed methods, which may hinder reproducibility. While the authors mention the collection of a new dataset, they do not provide access to the data or the code used for the experiments, which is critical for other researchers to validate the findings.
One limitation noted is the reliance on a specific hardware prototype for data collection, which may not be widely accessible. Additionally, the dataset, while extensive, may still lack diversity in certain contexts, potentially affecting the generalizability of the findings. The authors should also address the computational complexity of the proposed method, as it may impact real-time applications.
The implications of this research are substantial, as it opens up new avenues for ExG signal analysis in everyday environments. The ability to monitor physiological signals unobtrusively could lead to advancements in health monitoring, human-computer interaction, and personalized applications in various fields. The task-agnostic nature of the proposed approach also suggests potential for broader applications beyond the initial scope. The main contribution of this paper is the introduction of a novel approach for scalable, task-agnostic ExG monitoring using physiology-informed tokenization, which significantly enhances the generalization of ExG models across diverse tasks. The research addresses critical limitations in existing methodologies and presents a promising direction for future work in the field of machine learning and physiological signal processing.
The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.
Primary: Natural Science Foundation of China
All Institutions: Natural Science Foundation of China
This paper introduces EchoFake, a novel dataset designed to enhance the detection of audio deepfakes in real-world scenarios by integrating zero-shot TTS deepfakes with diverse replay recordings. The comprehensive methodology and experimental evaluation highlight critical weaknesses in existing models, paving the way for improved anti-spoofing systems.
The methodology presented in this paper is robust, focusing on the construction of the EchoFake dataset that integrates both zero-shot TTS deepfakes and physical replay recordings. The authors systematically varied playback and recording conditions to simulate real-world scenarios, which is a significant advancement over previous datasets that primarily relied on lab-generated samples. The use of diverse TTS models and the careful design of the dataset subsets (training, development, closed-set, and open-set evaluations) enhance the dataset's applicability for real-world detection tasks. However, the paper could benefit from more detailed descriptions of the TTS models used and their specific configurations.
The experimental evaluation is thorough, demonstrating the performance of baseline models on the EchoFake dataset and comparing them against existing benchmarks. The results clearly indicate that models trained on EchoFake show improved generalization capabilities, particularly in open-set scenarios, which is critical for real-world applications. The use of EER and F1-score as evaluation metrics is appropriate, but the paper could improve by providing more detailed statistical analyses of the results, such as confidence intervals or significance testing.
The authors have taken steps to ensure reproducibility by open-sourcing the dataset and providing scripts for dataset construction. However, the paper lacks detailed information on the training configurations for the baseline models, which could hinder full reproducibility of the results. Including hyperparameter settings and specific training procedures would enhance the reproducibility aspect.
One limitation of the study is the potential bias introduced by the specific TTS models chosen for generating synthetic speech. While the authors selected popular and state-of-the-art models, the generalizability of the findings may be limited to the characteristics of these models. Additionally, the dataset, while comprehensive, may not cover all possible replay scenarios, which could affect the robustness of the detection systems in even more diverse real-world conditions.
The implications of this research are significant, given the increasing prevalence of speech deepfakes and the associated risks of fraud and identity theft. By providing a more realistic dataset for training and evaluating anti-spoofing systems, this work could lead to the development of more effective detection methods, ultimately enhancing security in applications such as telecommunications and online identity verification. The open-source nature of the dataset also encourages further research and development in the field. This paper introduces EchoFake, a novel dataset designed to enhance the detection of audio deepfakes in real-world scenarios by integrating zero-shot TTS deepfakes with diverse replay recordings. The comprehensive methodology and experimental evaluation highlight critical weaknesses in existing models, paving the way for improved anti-spoofing systems.
Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of AMAuT, a flexible and efficient multiview audio transformer framework that eliminates the dependency on pre-trained weights while achieving state-of-the-art performance on audio classification tasks. This work significantly advances the field by addressing the limitations of existing models and providing a practical solution for diverse audio processing scenarios.
The proposed AMAuT framework introduces a novel architecture that integrates augmentation-driven multiview learning, a flexible 1D CNN bottleneck, and dual CLS + TAL tokens for improved contextual representation. The methodology is well-structured, allowing for arbitrary sample rates and audio lengths, which is a significant advancement over existing models that rely on fixed input constraints. The use of test-time adaptation and augmentation ($TTA^2$) enhances robustness, making the approach comprehensive and innovative.
The experiments conducted on five public benchmarks demonstrate that AMAuT achieves high accuracy (up to 99.8%) while significantly reducing computational costs compared to pre-trained models. The results are compelling, showcasing the framework's efficiency and flexibility across different datasets. However, the paper could benefit from more detailed comparisons with other state-of-the-art models beyond just accuracy metrics.
The authors provide a GitHub repository for the AMAuT implementation, which supports reproducibility. However, the paper lacks detailed hyperparameter settings and training configurations, which are crucial for ensuring that other researchers can replicate the results accurately.
The paper acknowledges limitations such as sensitivity to dataset size and hyperparameter dependence, particularly for smaller datasets. Additionally, the inference latency introduced by the test-time adaptation process could hinder real-time applications.
The AMAuT framework has the potential to democratize access to high-performance audio classification by enabling researchers and practitioners with limited computational resources to train state-of-the-art models from scratch. Its flexibility and efficiency could lead to broader applications in real-time audio processing, embedded systems, and low-resource environments. The main contribution of this paper is the introduction of AMAuT, a flexible and efficient multiview audio transformer framework that eliminates the dependency on pre-trained weights while achieving state-of-the-art performance on audio classification tasks. This work significantly advances the field by addressing the limitations of existing models and providing a practical solution for diverse audio processing scenarios.
Spoken Question-Answering (SQA) is a core capability for useful and interactive artificial intelligence systems. Recently, several speech-language models (SpeechLMs) have been released with a specific focus on improving their SQA performance. However, a lack of controlled ablations of pretraining data processing and curation makes it challenging to understand what factors account for performance, despite substantial gains from similar studies in other data modalities. In this work, we address this gap by conducting a data-centric exploration for pretraining SpeechLMs. We focus on three research questions fundamental to speech-language pretraining data: (1) how to process raw web-crawled audio content for speech-text pretraining, (2) how to construct synthetic pretraining datasets to augment web-crawled data and (3) how to interleave (text, audio) segments into training sequences. We apply the insights from our controlled data-centric ablations to pretrain a 3.8B-parameter SpeechLM, called SpeLangy, that outperforms models that are up to 3x larger by 10.2% absolute performance. We hope our findings highlight the impact of effective data curation for speech-language pretraining and guide future data-centric exploration in SpeechLMs.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a data-centric framework for pretraining SpeechLMs, demonstrating that careful data processing and augmentation can lead to substantial improvements in model performance. This work is significant as it addresses a gap in the understanding of how data quality affects the performance of speech-language models, paving the way for future advancements in the field.
The paper presents a data-centric approach to pretraining SpeechLMs, focusing on three critical aspects: processing raw web-crawled audio, constructing synthetic datasets, and interleaving audio-text segments. The methodology is well-structured, with controlled ablations that allow for a clear understanding of the impact of each component on model performance. However, the paper could benefit from more detailed descriptions of the data processing techniques and the synthetic dataset construction process to enhance reproducibility.
The experiments are robust, with a clear comparison between the proposed model, SpeLangy, and larger baseline models. The reported absolute performance improvement of 10.2% over models up to three times larger is significant and demonstrates the effectiveness of the data-centric approach. However, the paper lacks detailed information on the datasets used for evaluation, which could help contextualize the results further.
While the paper outlines the methodology and results, it does not provide sufficient implementation details or access to the datasets used, which may hinder reproducibility. Providing links to the datasets or a code repository would greatly enhance the ability of other researchers to replicate the findings.
One limitation is the lack of a comprehensive discussion on the potential biases introduced by the web-crawled data and synthetic datasets. Additionally, the paper does not explore the generalizability of the findings across different languages or dialects, which is critical for speech-language models.
The findings have significant implications for the development of interactive AI systems that rely on spoken question-answering capabilities. By emphasizing the importance of data curation, the paper encourages future research in data-centric methodologies, which could lead to more efficient and effective speech-language models. The main contribution of this paper is the introduction of a data-centric framework for pretraining SpeechLMs, demonstrating that careful data processing and augmentation can lead to substantial improvements in model performance. This work is significant as it addresses a gap in the understanding of how data quality affects the performance of speech-language models, paving the way for future advancements in the field.
Online Speech Enhancement was mainly reserved for predictive models. A key advantage of these models is that for an incoming signal frame from a stream of data, the model is called only once for enhancement. In contrast, generative Speech Enhancement models often require multiple calls, resulting in a computational complexity that is too high for many online speech enhancement applications. This work presents the Diffusion Buffer, a generative diffusion-based Speech Enhancement model which only requires one neural network call per incoming signal frame from a stream of data and performs enhancement in an online fashion on a consumer-grade GPU. The key idea of the Diffusion Buffer is to align physical time with Diffusion time-steps. The approach progressively denoises frames through physical time, where past frames have more noise removed. Consequently, an enhanced frame is output to the listener with a delay defined by the Diffusion Buffer, and the output frame has a corresponding look-ahead. In this work, we extend upon our previous work by carefully designing a 2D convolutional UNet architecture that specifically aligns with the Diffusion Buffer's look-ahead. We observe that the proposed UNet improves performance, particularly when the algorithmic latency is low. Moreover, we show that using a Data Prediction loss instead of Denoising Score Matching loss enables flexible control over the trade-off between algorithmic latency and quality during inference. The extended Diffusion Buffer equipped with a novel NN and loss function drastically reduces the algorithmic latency from 320 - 960 ms to 32 - 176 ms with an even increased performance. While it has been shown before that offline generative diffusion models outperform predictive approaches in unseen noisy speech data, we confirm that the online Diffusion Buffer also outperforms its predictive counterpart on unseen noisy speech data.
Primary: Universität Hamburg
All Institutions: Universität Hamburg
The main contribution of this paper is the introduction of the Diffusion Buffer, a novel approach to online generative speech enhancement that effectively reduces algorithmic latency while improving performance on unseen noisy speech data. This work represents a significant advancement in the field of speech processing, combining innovative methodologies with practical applications.
The paper introduces the Diffusion Buffer, a novel generative diffusion-based model for online speech enhancement that significantly reduces computational latency while maintaining performance. The methodology is well-structured, focusing on aligning physical time with diffusion time-steps, and employs a carefully designed 2D convolutional UNet architecture. The use of a Data Prediction loss function instead of the traditional Denoising Score Matching loss is a notable innovation, allowing for flexible control over latency and quality during inference. The proposed architecture and training strategy effectively address the challenges of real-time processing in speech enhancement.
The experiments are comprehensive, utilizing a well-defined dataset (EARS-WHAM-v2) and a robust evaluation framework. The results demonstrate that the proposed model outperforms traditional predictive models in various metrics, particularly under unseen noisy conditions. The paper provides a thorough analysis of the performance improvements achieved through the Diffusion Buffer, with clear comparisons to baseline models. However, the paper could benefit from additional ablation studies to further dissect the contributions of individual components.
The authors mention that code will be released upon acceptance, which is a positive aspect for reproducibility. The methodology is described in sufficient detail, enabling other researchers to replicate the experiments. However, the lack of specific hyperparameter settings and training configurations in the main text may hinder complete reproducibility.
One limitation is the reliance on a specific hardware configuration (consumer-grade GPUs), which may not generalize to all environments. Additionally, while the model shows promise in reducing latency, the trade-off between latency and quality could be further explored, particularly in real-world applications. The paper also does not address potential scalability issues when applied to larger datasets or more complex noise environments.
The proposed method has significant implications for real-time communication applications, such as video conferencing and VoIP, where clear audio is crucial. By improving speech enhancement capabilities on consumer-grade hardware, this research could enhance the user experience in various interactive platforms. The findings may also inspire further research into generative models for other audio processing tasks. The main contribution of this paper is the introduction of the Diffusion Buffer, a novel approach to online generative speech enhancement that effectively reduces algorithmic latency while improving performance on unseen noisy speech data. This work represents a significant advancement in the field of speech processing, combining innovative methodologies with practical applications.
Estimating piano dynamic from audio recordings is a fundamental challenge in computational music analysis. In this paper, we propose an efficient multi-task network that jointly predicts dynamic levels, change points, beats, and downbeats from a shared latent representation. These four targets form the metrical structure of dynamics in the music score. Inspired by recent vocal dynamic research, we use a multi-scale network as the backbone, which takes Bark-scale specific loudness as the input feature. Compared to log-Mel as input, this reduces model size from 14.7 M to 0.5 M, enabling long sequential input. We use a 60-second audio length in audio segmentation, which doubled the length of beat tracking commonly used. Evaluated on the public MazurkaBL dataset, our model achieves state-of-the-art results across all tasks. This work sets a new benchmark for piano dynamic estimation and delivers a powerful and compact tool, paving the way for large-scale, resource-efficient analysis of musical expression.
Primary: The University of Western Australia
All Institutions: The University of Western Australia
This paper presents a novel multi-task learning framework for estimating piano dynamics and metrical structure from audio, demonstrating significant advancements in model efficiency and performance. The integration of Bark-scale specific loudness and the innovative architecture contribute to its potential impact in the field of music analysis and machine learning.
The paper introduces a multi-task multi-scale network that effectively integrates the estimation of piano dynamics, change points, beats, and downbeats from audio recordings. The use of Bark-scale specific loudness as input is a notable innovation that addresses the limitations of traditional log-Mel spectrograms, particularly in terms of model size and efficiency. The architecture's design, which includes a Multi-gate Mixture-of-Experts (MMoE) layer, allows for specialized processing of different tasks while sharing a latent representation, showcasing a sophisticated understanding of multi-task learning. The methodology is well-justified and builds upon existing research in vocal dynamics, providing a solid theoretical foundation for the proposed approach.
The experiments are rigorously conducted using the MazurkaBL dataset, which is a well-suited choice for the tasks at hand. The authors employ a 5-fold cross-validation protocol, ensuring robust evaluation of their model's performance. The reported state-of-the-art results across all tasks, particularly in dynamics and change point detection, validate the effectiveness of their approach. The comparison with baseline methods is thorough, and the ablation studies provide valuable insights into the contributions of different components of the model.
The paper includes sufficient implementation details, including the architecture, feature extraction process, and training configurations, which enhance reproducibility. The availability of code and pre-trained models on GitHub further supports the potential for other researchers to replicate the work. However, the absence of a demo URL limits immediate accessibility for practical application.
One limitation is the reliance on the MazurkaBL dataset, which may not fully represent the diversity of piano performances across different genres and styles. Additionally, while the model achieves state-of-the-art results, the performance on downbeat detection is noted to be less optimal compared to other tasks, indicating potential areas for improvement. The paper also does not address the computational cost of the model in real-time applications, which could be a concern for practical deployment.
This work has significant implications for computational music analysis, particularly in enhancing music education, performance analysis, and automated music transcription systems. By providing a compact and efficient tool for estimating musical dynamics, it opens avenues for large-scale analyses of musical expression, potentially benefiting musicians, educators, and researchers alike. This paper presents a novel multi-task learning framework for estimating piano dynamics and metrical structure from audio, demonstrating significant advancements in model efficiency and performance. The integration of Bark-scale specific loudness and the innovative architecture contribute to its potential impact in the field of music analysis and machine learning.
Controlling speaking style in text-to-speech (TTS) systems has become a growing focus in both academia and industry. While many existing approaches rely on reference audio to guide style generation, such methods are often impractical due to privacy concerns and limited accessibility. More recently, large language models (LLMs) have been used to control speaking style through natural language prompts; however, their high computational cost, lack of interpretability, and sensitivity to prompt phrasing limit their applicability in real-time and resource-constrained environments. In this work, we propose ParaStyleTTS, a lightweight and interpretable TTS framework that enables expressive style control from text prompts alone. ParaStyleTTS features a novel two-level style adaptation architecture that separates prosodic and paralinguistic speech style modeling. It allows fine-grained and robust control over factors such as emotion, gender, and age. Unlike LLM-based methods, ParaStyleTTS maintains consistent style realization across varied prompt formulations and is well-suited for real-world applications, including on-device and low-resource deployment. Experimental results show that ParaStyleTTS generates high-quality speech with performance comparable to state-of-the-art LLM-based systems while being 30x faster, using 8x fewer parameters, and requiring 2.5x less CUDA memory. Moreover, ParaStyleTTS exhibits superior robustness and controllability over paralinguistic speaking styles, providing a practical and efficient solution for style-controllable text-to-speech generation. Demo can be found at https://parastyletts.github.io/ParaStyleTTS_Demo/. Code can be found at https://github.com/haoweilou/ParaStyleTTS.
Primary: University of New South Wales
All Institutions: University of New South Wales, CSIRO's Data61
ParaStyleTTS represents a significant contribution to the field of text-to-speech generation, offering an innovative approach to style control that balances efficiency and expressiveness. The methodology's emphasis on separating prosodic and paralinguistic features, combined with a robust experimental evaluation, positions this work as a valuable advancement in the ongoing development of TTS technologies.
The proposed methodology of ParaStyleTTS is innovative, utilizing a two-level style adaptation architecture that effectively separates prosodic and paralinguistic features, allowing for fine-grained control over speech generation. The model's end-to-end design is particularly noteworthy, as it eliminates the need for external vocoders and achieves high-quality speech synthesis directly from text prompts. The use of a lightweight architecture that is computationally efficient is a significant advancement over existing LLM-based methods, which are often resource-intensive and less interpretable.
The experimental evaluation is robust, employing a comprehensive multilingual dataset that includes diverse speech samples across different styles and emotions. The authors conducted both objective metrics and subjective evaluations (MOS tests) to assess intelligibility and naturalness, providing a well-rounded view of the model's performance. The results demonstrate that ParaStyleTTS achieves competitive performance compared to state-of-the-art models while significantly improving efficiency, which is crucial for real-world applications.
The paper includes sufficient details regarding the architecture, training procedures, and datasets used, which enhances reproducibility. The authors provide links to the demo and code repositories, allowing other researchers to replicate their findings and build upon their work. However, the lack of specific hyperparameters and training configurations could be a minor barrier to complete reproducibility.
While ParaStyleTTS shows strong performance, it still lags slightly behind LLM-based models in overall intelligibility and subjective naturalness. The model currently supports only three paralinguistic styles, which limits its applicability. Additionally, the authors acknowledge that expanding the training dataset and the range of controllable styles is a future direction, indicating that there is room for improvement.
The advancements presented in ParaStyleTTS have significant implications for the development of more efficient and expressive TTS systems. Its ability to operate effectively in resource-constrained environments makes it suitable for a wide range of applications, including virtual assistants, accessibility tools, and interactive storytelling. The model's robustness to prompt variations also enhances its usability in real-world scenarios, where user input may vary significantly. ParaStyleTTS represents a significant contribution to the field of text-to-speech generation, offering an innovative approach to style control that balances efficiency and expressiveness. The methodology's emphasis on separating prosodic and paralinguistic features, combined with a robust experimental evaluation, positions this work as a valuable advancement in the ongoing development of TTS technologies.
Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.
Primary: Millburn High School
All Institutions: Millburn High School
This paper presents a significant advancement in the field of speech processing by introducing novel end-to-end models for stutter correction. The innovative methodology and promising results highlight its potential impact on accessibility and inclusivity in speech technology.
The paper introduces two novel end-to-end models, StutterZero and StutterFormer, which directly convert stuttered speech into fluent speech while simultaneously predicting transcription. The methodology is well-structured, employing a convolutional-bidirectional LSTM for StutterZero and a dual-stream Transformer for StutterFormer. The use of multitask learning to jointly optimize both speech and transcription tasks is innovative and addresses a significant gap in existing literature. The training on synthesized datasets and the subsequent evaluation on unseen speakers demonstrate a thoughtful approach to model training and validation.
The experiments are comprehensive, utilizing multiple metrics (WER, CER, BERTScore) to evaluate model performance against state-of-the-art ASR systems. The results show significant improvements in transcription accuracy, with StutterFormer outperforming Whisper-Medium by a notable margin. The statistical significance of the results is well-supported by the Wilcoxon Signed-Rank Test, adding credibility to the findings. However, the reliance on TTS-generated data for training raises questions about the generalizability of the results.
The paper provides detailed implementation specifics, including data preprocessing, model architecture, and training configurations. However, the lack of publicly available code or datasets limits reproducibility. Future work could benefit from making the models and datasets accessible to the research community.
Key limitations include the reliance on TTS-generated data, which may not fully capture the nuances of natural speech, potentially affecting model performance in real-world scenarios. Additionally, the datasets used for training are limited in diversity, which could hinder the generalization of the models across different speaker demographics and accents. The hardware constraints faced during training also suggest that further exploration with more powerful resources could yield better results.
The research has significant implications for improving accessibility in human-computer interaction and speech therapy for individuals who stutter. By providing a tool that can convert stuttered speech into fluent speech in real-time, it opens avenues for enhancing communication experiences for millions of people. The potential applications in clinical settings and real-time communication systems could greatly benefit individuals with speech disorders. This paper presents a significant advancement in the field of speech processing by introducing novel end-to-end models for stutter correction. The innovative methodology and promising results highlight its potential impact on accessibility and inclusivity in speech technology.
In audio signal processing, learnable front-ends have shown strong performance across diverse tasks by optimizing task-specific representation. However, their parameters remain fixed once trained, lacking flexibility during inference and limiting robustness under dynamic complex acoustic environments. In this paper, we introduce a novel adaptive paradigm for audio front-ends that replaces static parameterization with a closed-loop neural controller. Specifically, we simplify the learnable front-end LEAF architecture and integrate a neural controller for adaptive representation via dynamically tuning Per-Channel Energy Normalization. The neural controller leverages both the current and the buffered past subband energies to enable input-dependent adaptation during inference. Experimental results on multiple audio classification tasks demonstrate that the proposed adaptive front-end consistently outperforms prior fixed and learnable front-ends under both clean and complex acoustic conditions. These results highlight neural adaptability as a promising direction for the next generation of audio front-ends.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of LEAF-APCEN, an adaptive audio front-end that dynamically tunes parameters for robust audio representation, significantly outperforming traditional fixed and learnable front-ends. This work represents a meaningful advancement in the field of audio signal processing, addressing critical challenges in adaptability and robustness.
The proposed methodology introduces a novel adaptive front-end architecture, LEAF-APCEN, which integrates a neural controller to dynamically adjust parameters of a simplified Per-Channel Energy Normalization (PCEN) module. This approach effectively addresses the limitations of static parameterization in traditional audio front-ends by enabling real-time adaptation to varying acoustic conditions. The simplification of PCEN from four to two parameters enhances efficiency without sacrificing performance, showcasing a thoughtful balance between complexity and effectiveness.
The experimental evaluation is comprehensive, covering multiple audio classification tasks under both clean and complex acoustic conditions. The results demonstrate significant improvements in accuracy for the proposed LEAF-APCEN compared to fixed and learnable front-ends, particularly in challenging environments. The use of diverse datasets strengthens the validation of the method, although more detailed statistical analysis could enhance the robustness of the claims.
The paper provides sufficient detail regarding the experimental setup, model configurations, and training procedures, which aids in reproducibility. However, the absence of a public repository or demo URL limits the ability for others to directly replicate the results. Including code and data access would significantly improve the reproducibility of the findings.
One notable limitation is the focus on single-channel audio inputs, which may restrict the applicability of the method in multi-channel scenarios. Additionally, while the results are promising, the performance in music genre classification under complex conditions did not show the same level of improvement, indicating potential areas for further exploration.
The proposed adaptive audio front-end has significant implications for various applications, including speech recognition, environmental sound classification, and music analysis. By enhancing robustness in dynamic acoustic environments, this work could lead to improved performance in real-world applications, particularly in settings with noise interference. The main contribution of this paper is the introduction of LEAF-APCEN, an adaptive audio front-end that dynamically tunes parameters for robust audio representation, significantly outperforming traditional fixed and learnable front-ends. This work represents a meaningful advancement in the field of audio signal processing, addressing critical challenges in adaptability and robustness.
Robust speaker verification under noisy conditions remains an open challenge. Conventional deep learning methods learn a robust unified speaker representation space against diverse background noise and achieve significant improvement. In contrast, this paper presents a noise-conditioned mixture-ofexperts framework that decomposes the feature space into specialized noise-aware subspaces for speaker verification. Specifically, we propose a noise-conditioned expert routing mechanism, a universal model based expert specialization strategy, and an SNR-decaying curriculum learning protocol, collectively improving model robustness and generalization under diverse noise conditions. The proposed method can automatically route inputs to expert networks based on noise information derived from the inputs, where each expert targets distinct noise characteristics while preserving speaker identity information. Comprehensive experiments demonstrate consistent superiority over baselines, confirming that explicit noise-dependent feature modeling significantly enhances robustness without sacrificing verification accuracy.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a noise-conditioned mixture-of-experts framework that effectively improves speaker verification robustness in noisy environments. This work presents a novel approach to feature modeling that could significantly advance the field of speaker verification, addressing a critical challenge in real-world applications.
The proposed noise-conditioned mixture-of-experts (NCMoE) framework is innovative in its approach to speaker verification under noisy conditions. By decomposing the feature space into noise-specific subspaces, the authors address the limitations of traditional unified feature modeling. The routing mechanism that selects expert networks based on noise characteristics is a significant contribution, as is the universal model-based expert specialization strategy, which allows for efficient training and specialization. The SNR-decaying curriculum learning protocol is also a thoughtful addition that enhances training stability and model robustness.
The experiments are comprehensive, utilizing the VoxCeleb1 dataset with various noise types and SNR levels. The results show clear improvements over baseline methods, demonstrating the effectiveness of the proposed framework. The ablation studies provide valuable insights into the importance of each component of the model, reinforcing the contributions made by the noise classification and expert specialization strategies. However, the paper could benefit from more extensive comparisons with state-of-the-art methods in real-world scenarios.
The implementation details are well-documented, including the architecture, training protocols, and data augmentation strategies. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Providing access to the code would significantly enhance the paper's impact and allow for further validation of the findings.
One limitation is the reliance on simulated noise conditions, which may not fully capture the complexities of real-world environments. Additionally, while the method shows improvements in noisy conditions, it is unclear how it performs in extreme noise scenarios or with unseen noise types. The paper could also explore the computational efficiency of the proposed method in more detail, as the mixture-of-experts approach may introduce overhead.
The proposed framework has significant implications for speaker verification systems, particularly in applications requiring robust performance in noisy environments, such as security and smart devices. By improving the accuracy of speaker verification under diverse conditions, this research could enhance user experience and security in various applications. The main contribution of this paper is the introduction of a noise-conditioned mixture-of-experts framework that effectively improves speaker verification robustness in noisy environments. This work presents a novel approach to feature modeling that could significantly advance the field of speaker verification, addressing a critical challenge in real-world applications.
Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of DELULU, a speaker-aware self-supervised foundational model that significantly enhances speaker-discriminative feature extraction in speech processing tasks. The innovative integration of external supervision and a dual training objective positions this work as a substantial advancement in the field of self-supervised learning for audio applications.
The methodology presented in DELULU is innovative, particularly in its integration of external supervision into the pseudo-label generation process. By utilizing frame-level embeddings from ReDimNet, the authors effectively introduce a speaker-discriminative inductive bias that enhances representation learning. The dual objective of masked prediction and denoising is a thoughtful approach that likely contributes to the model's robustness and generalization capabilities. However, the paper could benefit from a more detailed explanation of the k-means clustering step and how it interacts with the overall training process.
The experimental setup is comprehensive, with a clear focus on speaker-centric tasks. The reported results, including a 62% relative improvement in EER for speaker verification, are impressive and demonstrate the effectiveness of DELULU. The inclusion of zero-shot profiling tasks adds depth to the evaluation, showcasing the model's versatility. However, the paper lacks a comparison with a broader range of existing models, which would strengthen the claims of superiority.
The paper does not provide sufficient details on implementation, such as hyperparameters, training duration, or dataset specifics, which are crucial for reproducibility. While the results are promising, the absence of a code repository or supplementary materials limits the ability for other researchers to replicate the findings.
The paper acknowledges limitations, particularly in the reliance on external supervision, which may not be available in all scenarios. Additionally, the model's performance on diverse datasets beyond the ones tested could be a concern, as generalizability is critical in real-world applications.
DELULU has significant potential applications in speaker verification, diarization, and profiling, which are increasingly relevant in various sectors, including security and customer service. The model's ability to operate effectively without task-specific fine-tuning is particularly noteworthy, as it could facilitate broader adoption in practical applications. The main contribution of this paper is the introduction of DELULU, a speaker-aware self-supervised foundational model that significantly enhances speaker-discriminative feature extraction in speech processing tasks. The innovative integration of external supervision and a dual training objective positions this work as a substantial advancement in the field of self-supervised learning for audio applications.
The growing demand for home healthcare calls for tools that can support care delivery. In this study, we explore automatic health assessment from voice using real-world home care visit data, leveraging the diverse patient information it contains. First, we utilize Large Language Models (LLMs) to integrate Subjective, Objective, Assessment, and Plan (SOAP) notes derived from unstructured audio transcripts and structured vital signs into a holistic illness score that reflects a patient's overall health. This compact representation facilitates cross-visit health status comparisons and downstream analysis. Next, we design a multi-stage preprocessing pipeline to extract short speech segments from target speakers in home care recordings for acoustic analysis. We then employ an Audio Language Model (ALM) to produce plain-language descriptions of vocal biomarkers and examine their association with individuals' health status. Our experimental results benchmark both commercial and open-source LLMs in estimating illness scores, demonstrating their alignment with actual clinical outcomes, and revealing that SOAP notes are substantially more informative than vital signs. Building on the illness scores, we provide the first evidence that ALMs can identify health-related acoustic patterns from home care recordings and present them in a human-readable form. Together, these findings highlight the potential of LLMs and ALMs to harness heterogeneous in-home visit data for better patient monitoring and care.
Primary: Columbia University
All Institutions: Columbia University, Department of Computer Science, Department of Electrical Engineering, School of Nursing, The Fu Foundation School of Engineering and Applied Science
This study presents a pioneering approach to health assessment by leveraging LLMs and ALMs to analyze vocal biomarkers from home healthcare data. The innovative methodology and promising results indicate a significant step forward in the application of AI in healthcare, although further work is needed to address reproducibility and implementation challenges.
The methodology presented in this paper is robust, combining LLMs and ALMs to create a novel framework for health assessment based on vocal biomarkers. The integration of SOAP notes and vital signs into a unified illness score is innovative, allowing for a more holistic view of patient health. The multi-stage preprocessing pipeline for acoustic analysis is well-designed, addressing challenges inherent in real-world data collection. However, the reliance on LLMs for generating SOAP notes and illness scores raises questions about potential biases in the model outputs and the interpretability of the generated scores.
The experimental evaluation is thorough, with a clear focus on benchmarking various LLMs and ALMs. The use of real-world home care visit data adds significant value, as it reflects authentic patient-clinician interactions. The results demonstrate that LLM-generated illness scores align well with clinical outcomes, providing evidence of the method's effectiveness. However, the paper could benefit from more detailed statistical analysis and comparisons with traditional health assessment methods to strengthen the claims made.
The paper provides a comprehensive overview of the models and methods used, including specific LLMs and ALMs. However, it lacks detailed information on the implementation and access to the datasets, which may hinder reproducibility. The absence of a publicly accessible code repository or demo further limits the ability for others to replicate the study.
The study acknowledges several limitations, including the challenges of background noise and speaker overlap in real-world recordings. Additionally, the focus on the first 30 seconds of speech may overlook important acoustic cues that could emerge later in the conversation. The potential for LLMs to rely on contextual information rather than purely acoustic signals is another concern that warrants further investigation.
This research has significant implications for the future of home healthcare, particularly in enhancing patient monitoring through voice analysis. The findings suggest that vocal biomarkers can serve as valuable supplementary indicators of health status, which could lead to more timely interventions and improved patient outcomes. The approach also highlights the potential for integrating AI technologies into clinical practice, paving the way for more personalized and efficient healthcare solutions. This study presents a pioneering approach to health assessment by leveraging LLMs and ALMs to analyze vocal biomarkers from home healthcare data. The innovative methodology and promising results indicate a significant step forward in the application of AI in healthcare, although further work is needed to address reproducibility and implementation challenges.
We address the problem of estimating room impulse responses (RIRs) in noisy, uncontrolled environments where non-stationary sounds such as speech or footsteps corrupt conventional deconvolution. We propose AnyRIR, a non-intrusive method that uses music as the excitation signal instead of a dedicated test signal, and formulate RIR estimation as an L1-norm regression in the time-frequency domain. Solved efficiently with Iterative Reweighted Least Squares (IRLS) and Least-Squares Minimal Residual (LSMR) methods, this approach exploits the sparsity of non-stationary noise to suppress its influence. Experiments on simulated and measured data show that AnyRIR outperforms L2-based and frequency-domain deconvolution, under in-the-wild noisy scenarios and codec mismatch, enabling robust RIR estimation for AR/VR and related applications.
Primary: Aalto University
All Institutions: Aalto University, University of York, Friedrich-Alexander-Universität Erlangen-Nürnberg
The paper presents AnyRIR, a novel method for robust RIR estimation using music as an excitation signal, significantly advancing the field of acoustic measurement in noisy environments. The combination of innovative methodology and thorough experimental validation positions this work as a meaningful contribution to audio signal processing and machine learning applications in acoustics.
The proposed AnyRIR method innovatively utilizes music as an excitation signal for RIR estimation, which is a significant departure from traditional methods that rely on controlled signals. The formulation of the problem as an L1-norm regression in the time-frequency domain is well-justified, particularly for handling non-stationary noise. The use of IRLS and LSMR for efficient computation demonstrates a solid understanding of optimization techniques suitable for large-scale problems. However, the paper could benefit from a deeper exploration of the theoretical underpinnings of the L1-norm approach compared to other methods.
The experiments conducted on both simulated and real-world data are robust, showcasing the method's effectiveness in various scenarios, including codec mismatches and non-stationary noise. The comparison with baseline methods (L2-norm and frequency-domain deconvolution) is well-executed, providing clear evidence of AnyRIR's advantages. However, the paper could improve by including more diverse real-world environments to further validate the method's generalizability.
The authors provide a clear implementation of their method, with links to a GitHub repository and a demo page. The detailed description of the algorithms and the preprocessing steps enhances reproducibility. However, the paper lacks specific details on the datasets used, which could hinder full reproducibility for external researchers.
One limitation is the reliance on the assumption that non-stationary noise can be effectively modeled as outliers in the L1-norm framework. In highly dynamic environments with complex noise patterns, this assumption may not hold. Additionally, while the method shows promise in noisy environments, its performance in extremely noisy or chaotic settings remains to be tested.
The AnyRIR method has significant implications for applications in AR/VR, smart speakers, and other audio technologies, where accurate acoustic modeling is crucial. By enabling robust RIR estimation in uncontrolled environments, it opens avenues for more immersive audio experiences in public spaces. The paper presents AnyRIR, a novel method for robust RIR estimation using music as an excitation signal, significantly advancing the field of acoustic measurement in noisy environments. The combination of innovative methodology and thorough experimental validation positions this work as a meaningful contribution to audio signal processing and machine learning applications in acoustics.
Acoustic scene classification (ASC) suffers from device-induced domain shift, especially when labels are limited. Prior work focuses on curriculum-based training schedules that structure data presentation by ordering or reweighting training examples from easy-to-hard to facilitate learning; however, existing curricula are static, fixing the ordering or the weights before training and ignoring that example difficulty and marginal utility evolve with the learned representation. To overcome this limitation, we propose the Dynamic Dual-Signal Curriculum (DDSC), a training schedule that adapts the curriculum online by combining two signals computed each epoch: a domain-invariance signal and a learning-progress signal. A time-varying scheduler fuses these signals into per-example weights that prioritize domain-invariant examples in early epochs and progressively emphasize device-specific cases. DDSC is lightweight, architecture-agnostic, and introduces no additional inference overhead. Under the official DCASE 2024 Task~1 protocol, DDSC consistently improves cross-device performance across diverse ASC baselines and label budgets, with the largest gains on unseen-device splits.
Primary: Xi’an Jiaotong-Liverpool University
All Institutions: Xi’an Jiaotong-Liverpool University
The paper presents DDSC, a dynamic curriculum learning method that effectively addresses domain shift in acoustic scene classification, showcasing substantial improvements in model performance under low-label conditions. The innovative methodology and rigorous experimental validation position this work as a valuable contribution to the field of machine learning and audio processing.
The proposed Dynamic Dual-Signal Curriculum (DDSC) introduces a novel approach to curriculum learning by dynamically adjusting the weights of training examples based on two signals: domain-invariance and learning-progress. This methodology is innovative as it addresses the static nature of previous curriculum learning methods, allowing for a more adaptive learning process that evolves with the model's understanding. The use of prototype entropy for domain-invariance and smoothed loss change for learning progress is a thoughtful integration that enhances the model's ability to generalize across devices.
The experiments conducted on the DCASE 2024 Task 1 dataset demonstrate the effectiveness of DDSC across various architectures and label budgets, particularly under low-label conditions. The results indicate significant improvements in accuracy, especially on unseen-device splits, showcasing the method's robustness. The paper provides a comprehensive comparison with existing curriculum learning strategies, reinforcing the advantages of DDSC.
The paper mentions that code will be released upon acceptance, which is a positive step towards reproducibility. However, the absence of a direct link to a repository or demo limits immediate access to the implementation details. The methodology is described in sufficient detail to allow for replication, but the lack of a public codebase at this stage is a drawback.
One limitation is the reliance on the DCASE 2024 dataset, which may not fully represent all real-world scenarios in acoustic scene classification. Additionally, while the method is architecture-agnostic, its performance may vary with different model architectures not tested in this work. The paper could also benefit from a discussion on the computational efficiency of the proposed method in practical applications.
The DDSC methodology has significant implications for improving acoustic scene classification systems, particularly in environments with limited labeled data and varying device characteristics. This approach could enhance applications in smart devices, urban sound monitoring, and assistive technologies for hearing-impaired individuals, leading to more robust and adaptable audio recognition systems. The paper presents DDSC, a dynamic curriculum learning method that effectively addresses domain shift in acoustic scene classification, showcasing substantial improvements in model performance under low-label conditions. The innovative methodology and rigorous experimental validation position this work as a valuable contribution to the field of machine learning and audio processing.
Deploying emotion recognition systems in real-world environments where devices must be small, low-power, and private remains a significant challenge. This is especially relevant for applications such as tension monitoring, conflict de-escalation, and responsive wearables, where cloud-based solutions are impractical. Multimodal emotion recognition has advanced through deep learning, but most systems remain unsuitable for deployment on ultra-constrained edge devices. Prior work typically relies on powerful hardware, lacks real-time performance, or uses unimodal input. This paper addresses that gap by presenting a hardware-aware emotion recognition system that combines acoustic and linguistic features using a late-fusion architecture optimised for Edge TPU. The design integrates a quantised transformer-based acoustic model with frozen keyword embeddings from a DSResNet-SE network, enabling real-time inference within a 1.8MB memory budget and 21-23ms latency. The pipeline ensures spectrogram alignment between training and deployment using MicroFrontend and MLTK. Evaluation on re-recorded, segmented IEMOCAP samples captured through the Coral Dev Board Micro microphone shows a 6.3% macro F1 improvement over unimodal baselines. This work demonstrates that accurate, real-time multimodal emotion inference is achievable on microcontroller-class edge platforms through task-specific fusion and hardware-guided model design.
Primary: Imperial College London
All Institutions: Imperial College London, Nottingham Trent University
This paper presents a novel approach to multimodal emotion recognition on ultra-low-power edge devices, combining acoustic and linguistic features through a late-fusion architecture. The methodology is well-defined, and the results demonstrate a meaningful advancement in the field, although further validation and resource sharing would enhance its impact.
The paper presents a well-structured methodology for late fusion of audio and text features tailored for ultra-low-power edge devices. The integration of a quantised transformer-based acoustic model with a lightweight keyword spotting model is innovative, particularly in the context of real-time emotion recognition. The use of MicroFrontend and MLTK for spectrogram alignment demonstrates a thoughtful approach to ensuring consistency between training and deployment, which is critical for performance in edge environments. The late-fusion architecture is a significant contribution as it allows for the combination of multimodal data while adhering to strict resource constraints.
The experimental setup is robust, utilizing the IEMOCAP dataset for evaluation, which is appropriate for emotion recognition tasks. The reported 6.3% macro F1 improvement over unimodal baselines indicates a meaningful enhancement in performance, showcasing the effectiveness of the proposed architecture. However, the paper could benefit from a more extensive evaluation across diverse datasets and real-world conditions to fully validate the model's robustness and generalizability.
The paper provides a clear description of the model architecture and training procedures, which aids in reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for other researchers to replicate the results fully. Including such resources would enhance the paper's impact and utility in the research community.
The paper acknowledges limitations regarding robustness to varied environmental conditions and speaker diversity, which are critical factors in real-world applications. The focus on a specific dataset (IEMOCAP) may also limit the generalizability of the findings. Future work is suggested to address these limitations, but they remain a significant concern in the current study.
The proposed system has significant implications for privacy-preserving emotion recognition in wearable technology and other edge applications. By enabling real-time processing on low-power devices, the research opens avenues for practical applications in mental health monitoring, conflict resolution, and responsive devices, contributing to the growing field of ethical AI. This paper presents a novel approach to multimodal emotion recognition on ultra-low-power edge devices, combining acoustic and linguistic features through a late-fusion architecture. The methodology is well-defined, and the results demonstrate a meaningful advancement in the field, although further validation and resource sharing would enhance its impact.
Deep learning approaches for heart-sound (PCG) segmentation built on time--frequency features can be accurate but often rely on large expert-labeled datasets, limiting robustness and deployment. We present TopSeg, a topological representation-centric framework that encodes PCG dynamics with multi-scale topological features and decodes them using a lightweight temporal convolutional network (TCN) with an order- and duration-constrained inference step. To evaluate data efficiency and generalization, we train exclusively on PhysioNet 2016 dataset with subject-level subsampling and perform external validation on CirCor dataset. Under matched-capacity decoders, the topological features consistently outperform spectrogram and envelope inputs, with the largest margins at low data budgets; as a full system, TopSeg surpasses representative end-to-end baselines trained on their native inputs under the same budgets while remaining competitive at full data. Ablations at 10% training confirm that all scales contribute and that combining H_0 and H_1 yields more reliable S1/S2 localization and boundary stability. These results indicate that topology-aware representations provide a strong inductive bias for data-efficient, cross-dataset PCG segmentation, supporting practical use when labeled data are limited.
Primary: Xi’an Jiaotong-Liverpool University
All Institutions: Xi’an Jiaotong-Liverpool University
The paper presents TopSeg, a topological framework for data-efficient heart sound segmentation, which integrates multi-scale topological features with a lightweight TCN to achieve superior performance in low-data scenarios. This work represents a meaningful advancement in the application of topological data analysis in medical signal processing, with significant potential for real-world impact.
The proposed TopSeg framework utilizes a novel approach by integrating multi-scale topological features derived from persistent homology into a lightweight temporal convolutional network (TCN). This methodology is innovative as it operationalizes topological data analysis (TDA) specifically for phonocardiogram (PCG) segmentation, which has not been previously explored in this context. The extraction of topological features at multiple scales (global, meso, and fine) is well-justified, and the use of a convex refinement step at inference enhances the physiological consistency of the segmentation output. The methodology is robust and addresses the challenges of data efficiency in medical applications, making it a significant advancement in the field.
The experiments are comprehensive, utilizing the PhysioNet 2016 dataset for training and the CirCor dataset for external validation. The authors demonstrate the effectiveness of their approach through rigorous comparisons with baseline models, showing consistent performance improvements across various data budgets. The ablation studies effectively isolate the contributions of each component of the multi-scale topological features, providing clear evidence of their importance in achieving data-efficient segmentation. The results are statistically significant and support the claims made in the paper.
The paper provides sufficient detail regarding the methodology, including data preprocessing steps, model architecture, and training protocols, which enhances reproducibility. However, the absence of a public repository for the code or data limits the ability for others to directly replicate the results. Including a link to a project repository would significantly improve this aspect.
One limitation is the reliance on a single dataset for training and validation, which may affect the generalizability of the results to other datasets or clinical scenarios. Additionally, while the framework shows promise in low-data regimes, the performance in high-data scenarios is less emphasized, which could be a point of concern for practical applications. The absence of a demo or project URL also limits accessibility for further exploration of the framework.
The implications of this work are significant, particularly in the field of cardiac diagnostics where accurate and efficient segmentation of heart sounds can lead to better patient outcomes. The methodology could potentially be adapted for other biomedical signal processing tasks, thereby broadening its impact beyond just PCG segmentation. The focus on data efficiency is particularly relevant in clinical settings where labeled data is scarce, making this research highly applicable in real-world scenarios. The paper presents TopSeg, a topological framework for data-efficient heart sound segmentation, which integrates multi-scale topological features with a lightweight TCN to achieve superior performance in low-data scenarios. This work represents a meaningful advancement in the application of topological data analysis in medical signal processing, with significant potential for real-world impact.
Prevailing practice in learning-based audio watermarking is to pursue robustness by expanding the set of simulated distortions during training. However, such surrogates are narrow and prone to overfitting. This paper presents AWARE (Audio Watermarking with Adversarial Resistance to Edits), an alternative approach that avoids reliance on attack-simulation stacks and handcrafted differentiable distortions. Embedding is obtained via adversarial optimization in the time-frequency domain under a level-proportional perceptual budget. Detection employs a time-order-agnostic detector with a Bitwise Readout Head (BRH) that aggregates temporal evidence into one score per watermark bit, enabling reliable watermark decoding even under desynchronization and temporal cuts. Empirically, AWARE attains high audio quality and speech intelligibility (PESQ/STOI) and consistently low BER across various audio edits, often surpassing representative state-of-the-art learning-based audio watermarking systems.
Primary: unknown
All Institutions: unknown
This paper presents AWARE, a novel approach to audio watermarking that emphasizes robustness through adversarial optimization and a unique detection architecture. The technical contributions are substantial, providing a new direction for research in audio watermarking that could have lasting impacts on the field.
The methodology presented in AWARE is innovative as it shifts away from traditional attack-simulation stacks and handcrafted distortions, opting instead for an adversarial optimization approach in the time-frequency domain. The use of a Bitwise Readout Head (BRH) for detection is particularly noteworthy, as it aggregates temporal evidence in a manner that is agnostic to time-order, enhancing robustness against common audio edits. The embedding process is well-structured, utilizing perceptual budgets effectively and demonstrating a clear understanding of psychoacoustic principles. The paper also provides a comprehensive algorithmic breakdown, which aids in understanding the proposed methods.
The experimental evaluation is robust, comparing AWARE against strong baselines like WavMark and AudioSeal across multiple datasets (VCTK and LibriSpeech). The metrics used, including PESQ and STOI for audio quality and BER for robustness, are appropriate for the domain. The results indicate that AWARE achieves high audio quality and low BER across various audio edits, often outperforming state-of-the-art methods. The ablation studies further strengthen the findings by isolating key architectural components and their impacts on performance.
The paper lacks explicit details on the implementation, such as code availability or a public repository, which would facilitate reproducibility. While the methodology is described in detail, the absence of a project URL or demo limits the ability of other researchers to replicate the results independently.
One limitation noted is the slight compromise in audio quality compared to some baselines, which may be a trade-off for the increased robustness. Additionally, the paper does not address the performance of AWARE under all possible audio edits, and the robustness against more complex or novel distortions remains to be evaluated.
The implications of this research are significant, particularly in the context of digital rights management and content provenance in an era increasingly dominated by generative AI. The ability to watermark audio effectively and robustly against various edits could enhance trust in digital content and support efforts to combat misinformation and fraud. This paper presents AWARE, a novel approach to audio watermarking that emphasizes robustness through adversarial optimization and a unique detection architecture. The technical contributions are substantial, providing a new direction for research in audio watermarking that could have lasting impacts on the field.
Large Audio-Language Models (LALMs) are becoming essential as a powerful multimodal backbone for real-world applications. However, recent studies show that audio inputs can more easily elicit harmful responses than text, exposing new risks toward deployment. While safety alignment has made initial advances in LLMs and Large Vision-Language Models (LVLMs), we find that vanilla adaptation of these approaches to LALMs faces two key limitations: 1) LLM-based steering fails under audio input due to the large distributional gap between activations, and 2) prompt-based defenses induce over-refusals on benign-speech queries. To address these challenges, we propose Safe-Ablated Refusal Steering (SARSteer), the first inference-time defense framework for LALMs. Specifically, SARSteer leverages text-derived refusal steering to enforce rejection without manipulating audio inputs and introduces decomposed safe-space ablation to mitigate over-refusal. Extensive experiments demonstrate that SARSteer significantly improves harmful-query refusal while preserving benign responses, establishing a principled step toward safety alignment in LALMs.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of SARSteer, a novel inference-time defense framework for Large Audio-Language Models that aims to improve safety alignment by effectively managing harmful-query refusals while preserving benign responses. This work is significant as it addresses a critical gap in the safety of multimodal AI systems, particularly in the context of audio inputs, and proposes a unique methodology that could influence future research in the field.
The proposed Safe-Ablated Refusal Steering (SARSteer) framework is innovative in its approach to addressing the unique challenges posed by Large Audio-Language Models (LALMs). The methodology effectively combines text-derived refusal steering with a decomposed safe-space ablation technique, which is a novel contribution to the field. The authors provide a clear rationale for their approach, highlighting the limitations of existing safety alignment methods when applied to audio inputs. However, the paper could benefit from a more detailed explanation of the ablation process and its implementation.
The experiments conducted are extensive and demonstrate a significant improvement in harmful-query refusal rates while maintaining benign responses. The evaluation metrics used are appropriate, and the results are clearly presented. However, the paper lacks a comprehensive comparison with baseline methods, which would strengthen the claims of superiority for SARSteer. The datasets used for testing are not described in detail, which raises questions about the generalizability of the results.
The paper does not provide sufficient implementation details or code availability, which could hinder reproducibility. While the methodology is described, the absence of a clear protocol for replicating the experiments is a notable limitation. Including a link to a code repository or supplementary materials would enhance reproducibility.
One of the key limitations identified is the potential for over-refusal in benign-speech queries, which the authors attempt to mitigate through their proposed method. However, the effectiveness of this mitigation is not thoroughly validated across diverse audio inputs. Additionally, the paper does not address the scalability of the SARSteer framework in real-world applications, which is crucial for practical deployment.
The implications of this research are significant, as it addresses safety concerns in deploying LALMs, which are increasingly being integrated into various applications. The proposed framework could pave the way for safer audio interactions in AI systems, potentially impacting sectors such as customer service, healthcare, and education. However, the effectiveness of the method in real-world scenarios remains to be validated. The main contribution of this paper is the introduction of SARSteer, a novel inference-time defense framework for Large Audio-Language Models that aims to improve safety alignment by effectively managing harmful-query refusals while preserving benign responses. This work is significant as it addresses a critical gap in the safety of multimodal AI systems, particularly in the context of audio inputs, and proposes a unique methodology that could influence future research in the field.
We propose \textbf{U-Codec}, an \textbf{U}ltra low frame-rate neural speech \textbf{Codec} that achieves high-fidelity reconstruction and fast speech generation at an extremely low frame-rate of 5Hz (5 frames per second). Extreme compression at 5Hz typically leads to severe intelligibility and spectral detail loss, we introduce a Transformer-based inter-frame long-term dependency module and systematically explore residual vector quantization (RVQ) depth and codebook size to identify optimal configurations. Moreover, we apply U-Codec into a large language model (LLM)-based auto-regressive TTS model, which leverages global and local hierarchical architecture to effectively capture dependencies across multi-layer tokens. We extend LLM-based TTS from 3-layer RVQ at 50Hz to 32-layer RVQ at 5Hz. Experimental results demonstrate that U-Codec improves LLM-based TTS inference speed by around 3 $\times$ over high-frame-rate codecs while maintaining similarity and naturalness. These results validate the feasibility of using highly compressed 5Hz discrete tokens for fast and high-fidelity speech synthesis.
Primary: Peking University
All Institutions: Peking University, Tencent AILAB Group
The U-Codec presents a significant advancement in neural speech codecs, achieving high-fidelity speech synthesis at an unprecedented low frame rate of 5Hz, thereby enhancing computational efficiency while maintaining quality. The innovative methodology and comprehensive experimental validation position this work as a notable contribution to the field of machine learning and audio processing.
The proposed U-Codec introduces a novel architecture that combines a Transformer-based inter-frame long-term dependency module with a hierarchical global-local Transformer architecture, effectively addressing the challenges of speech synthesis at ultra-low frame rates. The systematic exploration of residual vector quantization (RVQ) depth and codebook size is well-structured, providing a comprehensive approach to optimizing speech quality under extreme compression. The introduction of the Codecformer network is particularly innovative, as it allows for efficient modeling of long sequences while maintaining high fidelity.
The experiments are robust, utilizing a large-scale multilingual dataset and a variety of evaluation metrics such as WER, PESQ, and STOI to assess performance. The results demonstrate significant improvements in inference speed and speech quality compared to existing high-frame-rate codecs, establishing a new benchmark in the field. However, the paper could benefit from additional comparative analyses with more recent state-of-the-art methods.
The paper provides a clear description of the training setup, datasets, and evaluation metrics, which enhances reproducibility. The release of a demo and code is a positive aspect, allowing others to validate the findings. However, specific implementation details, such as hyperparameters and training configurations, could be more thoroughly documented.
While the U-Codec shows promising results, it does not yet match the performance of certain high-frame-rate systems in terms of PESQ and SPK-SIM. Additionally, the complexity of the model increases with deeper RVQ stacks, which may limit practical applications in resource-constrained environments.
The U-Codec has the potential to significantly impact real-time speech synthesis applications, especially in scenarios where computational efficiency is critical, such as mobile devices and low-latency communication systems. Its ability to maintain high fidelity at ultra-low frame rates could lead to advancements in various fields, including virtual assistants, gaming, and accessibility tools. The U-Codec presents a significant advancement in neural speech codecs, achieving high-fidelity speech synthesis at an unprecedented low frame rate of 5Hz, thereby enhancing computational efficiency while maintaining quality. The innovative methodology and comprehensive experimental validation position this work as a notable contribution to the field of machine learning and audio processing.
Large audio-language models (LALMs) extend text-based LLMs with auditory understanding, offering new opportunities for multimodal applications. While their perception, reasoning, and task performance have been widely studied, their safety alignment under paralinguistic variation remains underexplored. This work systematically investigates the role of speaker emotion. We construct a dataset of malicious speech instructions expressed across multiple emotions and intensities, and evaluate several state-of-the-art LALMs. Our results reveal substantial safety inconsistencies: different emotions elicit varying levels of unsafe responses, and the effect of intensity is non-monotonic, with medium expressions often posing the greatest risk. These findings highlight an overlooked vulnerability in LALMs and call for alignment strategies explicitly designed to ensure robustness under emotional variation, a prerequisite for trustworthy deployment in real-world settings.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper presents a pioneering investigation into the safety vulnerabilities of large audio-language models under emotional variations, revealing critical insights that could influence the design of more robust AI systems. The comprehensive methodology and significant findings underscore its importance in advancing the field of multimodal AI and safety alignment.
The methodology employed in this study is robust and systematic, involving the construction of a dataset specifically designed to test the safety alignment of LALMs under emotional variations. The authors utilize a well-defined process for synthesizing emotional speech instructions and ensure the quality of their dataset through human annotation. The use of established metrics (NRR and UR) to evaluate safety alignment is appropriate, although the reliance on pattern matching for NRR could be seen as a limitation. Overall, the methodology is sound and contributes significantly to the study's findings.
The experimental evaluation is comprehensive, covering a range of state-of-the-art LALMs and providing detailed results on their safety performance across different emotional expressions and intensities. The results clearly demonstrate the variability in safety alignment, highlighting the non-monotonic relationship between emotional intensity and unsafe responses. The analysis of multiple models adds depth to the findings, although the paper could benefit from a more detailed discussion of the statistical significance of the results.
The paper provides sufficient detail regarding the dataset construction and experimental setup, allowing for reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. The dataset is made available, which is a positive aspect for reproducibility.
One limitation is the potential bias in the dataset construction, as the emotional expressions are synthesized rather than recorded from real speakers, which may not fully capture the nuances of human emotion. Additionally, the study does not explore the underlying causes of the observed safety vulnerabilities, which could be critical for developing effective mitigation strategies. The reliance on specific models may also limit the generalizability of the findings.
The findings of this study have significant implications for the deployment of LALMs in real-world applications, particularly in sensitive areas where safety is paramount. By highlighting the vulnerabilities introduced by emotional variations, the research calls for improved safety alignment strategies, which could enhance the trustworthiness of AI systems in human-AI interactions. This work lays the groundwork for future research aimed at addressing these vulnerabilities. This paper presents a pioneering investigation into the safety vulnerabilities of large audio-language models under emotional variations, revealing critical insights that could influence the design of more robust AI systems. The comprehensive methodology and significant findings underscore its importance in advancing the field of multimodal AI and safety alignment.
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models (SLMs). However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, demonstrating superior perceptual quality and intelligibility. Moreover, SAC substantially outperforms state-of-the-art codecs in semantic representation, achieving a level comparable to that of self-supervised learning (SSL) continuous embeddings. Finally, our analysis of speech disentanglement highlights the effectiveness of the dual-stream design, offering new potential for controllable speech applications.
Primary: Soul AI Lab
All Institutions: Soul AI Lab
The paper presents SAC, a novel neural speech codec that effectively disentangles semantic and acoustic information, achieving state-of-the-art performance in speech reconstruction and semantic representation. This work is a significant contribution to the field of audio processing, offering innovative methodologies and robust experimental validation that could shape future research and applications in speech technology.
The proposed SAC architecture introduces a dual-stream quantization approach that effectively disentangles semantic and acoustic representations, allowing for specialized optimization of each stream. This is a significant methodological advancement over existing codecs that typically fuse these representations, leading to potential improvements in both semantic fidelity and reconstruction quality. The use of a pre-trained semantic tokenizer and the incorporation of speaker feature supervision are innovative strategies that enhance the model's performance in capturing linguistic content and timbre, respectively. The architecture is built on a VQ-GAN framework, which is well-suited for the task of speech reconstruction.
The experimental evaluation is comprehensive, utilizing diverse datasets and metrics to assess both reconstruction quality and semantic representation. The paper reports strong performance across various bitrates, demonstrating the robustness of SAC in both clean and noisy conditions. The results indicate that SAC outperforms state-of-the-art codecs in both speech intelligibility and semantic representation, which is validated through rigorous ablation studies. The use of benchmarks like UTMOS and WER provides a solid foundation for the claims made regarding performance improvements.
The paper includes detailed descriptions of the training setup, datasets, and evaluation metrics, which are essential for reproducibility. The authors provide links to the code and pre-trained models, facilitating further research and validation of their findings. However, the paper could benefit from additional details on hyperparameter tuning and specific training configurations to enhance reproducibility.
While SAC shows promising results, its generalizability to other audio domains, such as music or non-speech sounds, remains untested. The reliance on a speech-specific semantic tokenizer may limit the model's applicability in broader contexts. Additionally, the performance of SAC in real-world applications, such as live speech processing, has not been evaluated, which could reveal further limitations.
The advancements presented in SAC have significant implications for applications in speech compression, synthesis, and understanding, particularly in scenarios requiring high fidelity and intelligibility. The ability to disentangle semantic and acoustic features opens new avenues for controllable speech applications, such as voice conversion and personalized text-to-speech systems. This work could influence future research directions in audio processing and machine learning, promoting further exploration of dual-stream architectures in other domains. The paper presents SAC, a novel neural speech codec that effectively disentangles semantic and acoustic information, achieving state-of-the-art performance in speech reconstruction and semantic representation. This work is a significant contribution to the field of audio processing, offering innovative methodologies and robust experimental validation that could shape future research and applications in speech technology.
Generative models have shown robust performance on speech enhancement and restoration tasks, but most prior approaches operate offline with high latency, making them unsuitable for streaming applications. In this work, we investigate the feasibility of a low-latency, real-time generative speech restoration system based on flow-matching (FM). Our method tackles diverse real-world tasks, including denoising, dereverberation, and generative restoration. The proposed causal architecture without time-downsampling achieves introduces an total latency of only 20 ms, suitable for real-time communication. In addition, we explore a broad set of architectural variations and sampling strategies to ensure effective training and efficient inference. Notably, our flow-matching model maintains high enhancement quality with only 5 number of function evaluations (NFEs) during sampling, achieving similar performance as when using ~20 NFEs under the same conditions. Experimental results indicate that causal FM-based models favor few-step reverse sampling, and smaller backbones degrade with longer reverse trajectories. We further show a side-by-side comparison of FM to typical adversarial-loss-based training for the same model architecture.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the development of a low-latency, real-time generative speech restoration system using flow-matching. The work is significant as it addresses the critical challenge of latency in speech enhancement, providing a promising direction for future research in real-time audio processing.
The paper introduces a novel causal architecture for flow-matching in speech restoration, which is significant for real-time applications. The methodology is well-structured, leveraging flow-matching principles to achieve low-latency processing. The authors provide a comprehensive explanation of their approach, including the mathematical foundation and architectural choices, which are critical for understanding the model's performance. However, the paper could benefit from a more detailed discussion of the architectural variations explored and their specific impacts on performance.
The experimental setup is robust, utilizing established datasets (DNS and SIG challenges) for evaluation. The authors provide a thorough analysis of their model's performance across different metrics, including SIGMOS and DistillMOS, which are relevant for assessing audio quality. The comparison with GAN-based models is particularly insightful, highlighting the strengths and weaknesses of flow-matching in real-time scenarios. However, the results could be enhanced with more extensive ablation studies to isolate the effects of various architectural choices.
The paper outlines the training and evaluation processes clearly, including data augmentation strategies and evaluation metrics. However, the lack of a publicly available code repository limits reproducibility. Providing access to the model and training scripts would significantly enhance the ability of other researchers to replicate the findings.
The paper acknowledges the limitations of the flow-matching approach, particularly in comparison to GAN-based methods. It notes that while the proposed models achieve low latency, they do not consistently outperform existing techniques, indicating room for improvement. Additionally, the performance degradation observed with smaller backbones and longer reverse trajectories suggests that further optimization is needed.
This research has significant implications for real-time communication applications, such as VoIP and teleconferencing, where low-latency speech restoration is crucial. The findings could influence future work in generative models for audio processing, potentially leading to advancements in other areas of speech technology. The main contribution of this paper is the development of a low-latency, real-time generative speech restoration system using flow-matching. The work is significant as it addresses the critical challenge of latency in speech enhancement, providing a promising direction for future research in real-time audio processing.