Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
The methodology presented in this paper is robust and innovative, addressing the unique challenges of continual learning (CL) in audio contexts, particularly the upstream-downstream misalignment that has hindered previous approaches. The introduction of PACE, which combines improved first-session adaptation (FSA) with multi-session adaptation (MSA) and boundary-aware regularization, is a significant advancement. The paper meticulously details the design choices behind each component, demonstrating a clear understanding of the audio domain's intricacies. The use of analytic classifiers and adaptive subspace-orthogonal PEFT is particularly noteworthy, as it showcases a tailored approach to audio CL that diverges from traditional vision-based methods.
The experimental evaluation is thorough, employing six diverse audio CL benchmarks that effectively highlight the strengths and weaknesses of the proposed method. The results consistently demonstrate that PACE outperforms state-of-the-art methods, providing strong empirical evidence for its effectiveness. The ablation studies further reinforce the validity of the proposed components, illustrating how each contributes to the overall performance. However, the paper could benefit from additional comparisons with more recent methods in the audio domain, if available.
The authors commit to releasing their code and benchmarks, which is a positive aspect for reproducibility. The detailed descriptions of the experimental setup, including hyperparameters and dataset configurations, enhance the likelihood that other researchers can replicate the results. However, the absence of a demo or interactive component limits immediate accessibility for broader audiences.
One limitation is the potential for overfitting in fine-grained tasks, as indicated by the authors. The paper also acknowledges that while PACE narrows the gap to joint training, it does not completely eliminate it, suggesting that further improvements could be made. Additionally, the reliance on specific pretrained models may limit the generalizability of the findings across different audio tasks.
The implications of this work are significant, particularly for applications in speech recognition, audio event detection, and environmental sound understanding. By addressing the challenges of continual learning in audio, the proposed methods could enhance the robustness and adaptability of audio models in real-world scenarios, leading to more effective and reliable systems. The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Ping An Technology (Shenzhen) Co
The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
The proposed PL-Distill framework introduces a dual-level knowledge distillation approach that effectively addresses the challenges of distilling large audio-language models for speech emotion recognition. The incorporation of Attention-weighted Centered Kernel Alignment (AwCKA) is particularly innovative, as it dynamically prioritizes important audio tokens based on attention scores, thereby enhancing the alignment of audio embeddings despite dimensional mismatches. This methodological advancement is well-justified in the context of previous work and represents a significant contribution to the field of knowledge distillation in multimodal models.
The experimental evaluation is robust, utilizing three widely recognized datasets (IEMOCAP, RAVDESS, and SAVEE) to validate the effectiveness of the proposed method. The results demonstrate that PL-Distill not only compresses the teacher model significantly but also outperforms both the teacher and state-of-the-art models across all metrics. The ablation studies further substantiate the contributions of each component of the framework, providing a clear understanding of the impact of the proposed methods.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully generalize to other SER tasks or datasets. Additionally, while the method shows promise, the computational efficiency of the distillation process itself could be further explored to ensure practical applicability in real-world scenarios.
The implications of this research extend beyond speech emotion recognition, as the PL-Distill framework could be adapted for various audio-language tasks, potentially improving the efficiency of deploying large models in resource-constrained environments. The focus on effective knowledge transfer in multimodal contexts may also inspire future research in related areas. The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of ARCHI-TTS, a novel non-autoregressive text-to-speech model that effectively addresses the challenges of text-speech alignment and computational efficiency through innovative architectural components. The comprehensive analysis of its technical contributions, methodology, and results positions it as a significant advancement in the TTS domain, with potential for impactful applications in various audio synthesis tasks.
The methodology proposed in ARCHI-TTS is innovative, combining a semantic aligner with a flow-matching decoder to address the challenges of text-speech alignment and inference efficiency in TTS systems. The use of a low-token-rate representation derived from a Variational Autoencoder (VAE) is a significant advancement, allowing for a more compact representation of audio data while maintaining quality. The architecture's reliance on a transformer-based semantic aligner to create self-supervised text-aligned semantic representations is a novel approach that enhances the model's ability to generate coherent and contextually relevant speech. The integration of an auxiliary CTC loss to bolster semantic understanding further demonstrates a thoughtful approach to improving the model's performance.
The experimental evaluation is robust, utilizing a large-scale multilingual dataset (100k hours) for training and multiple established benchmarks for testing. The reported results, including a WER of 1.98% on the LibriSpeech-PC test-clean and competitive performance on the SeedTTS test set, indicate that ARCHI-TTS outperforms several state-of-the-art models while using fewer computational resources. The inclusion of ablation studies adds depth to the evaluation, providing insights into the contributions of various architectural components. However, the paper could benefit from more extensive subjective evaluations to further validate the quality of the generated speech.
The paper provides sufficient details regarding the model configuration, training process, and evaluation metrics, which should facilitate reproducibility. The authors mention the use of specific hardware (8 RTX 5090 GPUs) and training duration, which are valuable for replicating the experiments. However, the lack of a direct link to the code repository limits accessibility for other researchers wishing to reproduce the results.
While the proposed model shows promising results, it does exhibit some limitations, such as slightly lagging behind other state-of-the-art models in subjective quality evaluations. The reliance on a specific dataset (Emilia) may also limit the generalizability of the findings. Additionally, the computational efficiency improvements come at the cost of some performance degradation, which may need further exploration.
The advancements presented in ARCHI-TTS have significant implications for the field of TTS and audio synthesis, particularly in enhancing the efficiency and quality of speech generation. The model's ability to perform zero-shot synthesis with high fidelity could lead to broader applications in voice cloning, audiobooks, and interactive voice response systems. As TTS technology continues to evolve, the methodologies introduced in this paper could influence future research directions and commercial applications. The main contribution of this paper is the introduction of ARCHI-TTS, a novel non-autoregressive text-to-speech model that effectively addresses the challenges of text-speech alignment and computational efficiency through innovative architectural components. The comprehensive analysis of its technical contributions, methodology, and results positions it as a significant advancement in the TTS domain, with potential for impactful applications in various audio synthesis tasks.
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
Primary: Zhejiang University
All Institutions: Zhejiang University
The main contribution of this paper is the introduction of HyperPotter, a hypergraph-based framework for audio deepfake detection that effectively captures high-order interactions, demonstrating substantial improvements over existing methods. This work represents a meaningful advancement in the field of audio deepfake detection, with the potential to influence future research directions and applications.
The proposed HyperPotter framework introduces a novel approach to audio deepfake detection by leveraging hypergraphs to model high-order interactions (HOIs). This is a significant departure from traditional methods that focus primarily on local features or pairwise relations. The use of clustering-based hyperedges with class-aware prototype initialization is innovative and suggests a deeper understanding of the relationships between features. However, the paper could benefit from a more detailed explanation of the hypergraph construction process and the specific clustering techniques employed.
The experiments are extensive, covering 11 datasets and demonstrating a relative gain of 22.15% over baseline methods, as well as a 13.96% improvement over state-of-the-art methods on challenging cross-domain datasets. This breadth of evaluation is commendable and indicates robust performance across various scenarios. However, the paper lacks a detailed comparison with other recent methodologies in the field, which could provide further context for the results.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Clear guidelines on how to replicate the experiments, including hyperparameter settings and dataset access, would enhance the paper's impact.
One limitation is the potential complexity of the hypergraph model, which may require significant computational resources and expertise to implement. Additionally, while the results are promising, the paper does not address the scalability of the approach or its performance in real-time applications.
The implications of this research are significant, particularly in the context of increasing audio deepfake threats. The ability to detect sophisticated audio manipulations could enhance security in various applications, including media verification, cybersecurity, and content authenticity. The methodology could also inspire further research into high-order interactions in other domains beyond audio. The main contribution of this paper is the introduction of HyperPotter, a hypergraph-based framework for audio deepfake detection that effectively captures high-order interactions, demonstrating substantial improvements over existing methods. This work represents a meaningful advancement in the field of audio deepfake detection, with the potential to influence future research directions and applications.
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Primary: Nankai University
All Institutions: Nankai University, Alibaba International Digital Commerce, University of Exeter
Speech-XL presents a significant advancement in long-form speech understanding through its innovative use of Speech Summarization Tokens and curriculum learning strategies. This work not only addresses critical limitations in existing models but also sets the stage for future developments in efficient audio processing methodologies.
The methodology presented in Speech-XL is innovative, particularly with the introduction of the Speech Summarization Token (SST) for compressing long-form audio data. The model effectively addresses the limitations of existing Large Speech Language Models (LSLMs) by leveraging a curriculum learning approach to progressively train the SST for varying compression ratios. This structured training strategy enhances the model's ability to maintain semantic integrity while reducing memory usage. The dual-adapter bridge architecture is also a notable contribution, allowing for effective integration of acoustic and semantic features into the LLM's framework.
The experimental setup is robust, utilizing significant datasets like LongSpeech and AudioMarathon to evaluate the model's performance across various tasks. The results indicate that Speech-XL outperforms existing models in several benchmarks, demonstrating its effectiveness in long-form audio understanding. The comparative analysis with upper-bound models and other state-of-the-art systems provides a clear picture of its capabilities, although the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper outlines a clear training process and provides details on the datasets used, model architecture, and training parameters. However, the absence of a publicly accessible code repository or demo limits reproducibility. Future work should consider releasing the model and training scripts to enhance transparency and allow for independent verification of results.
One limitation is the reliance on a relatively small training dataset for certain tasks, which may affect the generalizability of the model across diverse audio contexts. Additionally, the model's performance in out-of-domain evaluations suggests that it may struggle with audio types not represented in the training data. The authors acknowledge the need for broader training data to fully leverage the SST mechanism's potential.
The advancements in long-form speech understanding have significant implications for various applications, including transcription services, virtual assistants, and accessibility technologies. By improving the efficiency and accuracy of processing long audio sequences, Speech-XL could enhance user experiences in these domains. The work also opens avenues for future research into more sophisticated audio processing techniques that could benefit from the SST framework. Speech-XL presents a significant advancement in long-form speech understanding through its innovative use of Speech Summarization Tokens and curriculum learning strategies. This work not only addresses critical limitations in existing models but also sets the stage for future developments in efficient audio processing methodologies.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
The paper introduces the Audio ControlNet framework, which innovatively builds on pre-trained text-to-audio (T2A) models by integrating lightweight auxiliary networks for fine-grained control over audio attributes such as loudness, pitch, and sound events. The two proposed architectures, T2A-ControlNet and T2A-Adapter, are well-structured, with T2A-Adapter demonstrating efficiency through fewer parameters while maintaining high performance. The methodology is sound, leveraging established techniques from the ControlNet paradigm and adapting them to the audio domain, thus showcasing a thoughtful approach to enhancing existing models without extensive retraining.
The experiments are comprehensive, utilizing the AudioSet-Strong dataset for both training and evaluation, which is appropriate given the task. The results indicate that T2A-Adapter achieves state-of-the-art performance in sound event detection metrics, outperforming existing models while using significantly fewer parameters. The paper includes both objective metrics (F1 scores) and subjective evaluations (MOS), providing a well-rounded assessment of model performance. However, the paper could benefit from more detailed comparisons with a broader range of baseline models to further validate its claims.
The authors mention plans to release models, code, dataset pipelines, and benchmarks, which is a positive step towards reproducibility. However, specific implementation details, such as hyperparameter settings and training configurations, could be more explicitly stated to enhance clarity and facilitate replication by other researchers.
The paper acknowledges limitations, such as the computational constraints that prevented exhaustive hyperparameter searches and the focus on a limited set of control conditions. Additionally, the reliance on generalization for multi-condition control at inference time may not be robust across all scenarios. Future work is suggested to explore richer control signals and more comprehensive multi-condition training.
The framework has significant potential applications in sound design, music creation, and video production, where precise audio generation and editing are crucial. The ability to manipulate audio attributes with fine granularity can enhance creative workflows and enable new forms of audio content generation. However, ethical considerations regarding the misuse of generated audio, such as impersonation or disinformation, must be addressed to ensure responsible deployment. The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Primary: Huawei Noah's Ark Lab
All Institutions: Huawei Noah's Ark Lab
This paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representations and their alignment with human cognitive processes. The innovative methodology and rigorous experimental evaluation contribute valuable insights to the field of machine learning in audio processing.
The paper employs Sparse Autoencoders (SAEs) to analyze the activations of Whisper and HuBERT models, providing a systematic approach to feature extraction and interpretability in audio processing. The methodology includes a comprehensive evaluation of feature stability, interpretability, and practical applications, which is a significant advancement in the field. The use of various metrics for validation and the introduction of novel techniques for feature steering and EEG correlation analysis enhance the robustness of the methodology.
The experiments are well-structured, utilizing a diverse corpus of audio data for training and evaluation. The authors demonstrate the effectiveness of SAEs in capturing semantic and paralinguistic information, with results showing a substantial reduction in false detections when steering Whisper's features. The correlation with EEG activity adds a neuroscientific dimension to the findings, indicating a deeper understanding of audio processing in relation to human cognition.
The paper provides detailed implementation information, including model architectures, training setups, and hyperparameters, which supports reproducibility. The availability of code and checkpoints on GitHub further enhances the potential for other researchers to replicate the study and build upon its findings.
The paper acknowledges limitations in its scope, including a focus on specific classification tasks and the exclusion of larger model variants due to computational constraints. Additionally, the auto-interpretation method's reliance on a captioning model trained primarily on music and sound data may lead to generic interpretations of speech-related features.
The findings have significant implications for audio processing applications, particularly in improving speech recognition systems and understanding human auditory processing. The techniques developed could be applied to various domains, including speech enhancement, emotion recognition, and environmental sound classification, potentially leading to advancements in human-computer interaction and accessibility technologies. This paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representations and their alignment with human cognitive processes. The innovative methodology and rigorous experimental evaluation contribute valuable insights to the field of machine learning in audio processing.
Transformer-based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi-head self-attention (MHSA) mechanism. MHSA provides frame-level attention scores, which are particularly valuable because deepfake artifacts often occur in small, localized regions along the temporal dimension of speech. This makes fine-grained frame modeling essential for accurately detecting subtle spoofing cues. In this work, we propose fine-grained frame modeling (FGFM) for MHSA-based speech deepfake detection, where the most informative frames are first selected through a multi-head voting (MHV) module. These selected frames are then refined via a cross-layer refinement (CLR) module to enhance the model's ability to learn subtle spoofing cues. Experimental results demonstrate that our method outperforms the baseline model and achieves Equal Error Rate (EER) of 0.90%, 1.88%, and 6.64% on the LA21, DF21, and ITW datasets, respectively. These consistent improvements across multiple benchmarks highlight the effectiveness of our fine-grained modeling for robust speech deepfake detection.
Primary: Hanoi University of Science and Technology
All Institutions: Hanoi University of Science and Technology, Nanyang Technological University
The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
The proposed methodology introduces a novel fine-grained frame modeling (FGFM) approach that effectively enhances the multi-head self-attention (MHSA) mechanism for speech deepfake detection. The integration of the multi-head voting (MHV) module to select salient frames and the cross-layer refinement (CLR) module to aggregate information across layers is innovative. This dual approach addresses the limitations of conventional MHSA by focusing on localized artifacts, which are critical for detecting subtle spoofing cues. The methodology is well-structured and builds upon existing transformer architectures, demonstrating a clear understanding of the challenges in deepfake detection.
The experimental evaluation is robust, utilizing multiple datasets (ASVspoof 2021 LA, DF, and ITW) to validate the effectiveness of the proposed method. The reported Equal Error Rates (EER) indicate significant improvements over baseline models, showcasing the method's effectiveness across diverse conditions. The inclusion of ablation studies further strengthens the evaluation, providing insights into the contributions of individual components of the proposed framework.
The paper provides sufficient detail regarding the experimental setup, including model configurations and training procedures, which supports reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings. Future work should consider making the implementation accessible to enhance reproducibility.
While the proposed method shows promising results, it may still be sensitive to variations in the quality of the input audio, such as background noise or recording conditions. Additionally, the reliance on specific datasets may limit the generalizability of the findings to real-world applications. The paper could benefit from a discussion on how the model performs under such conditions.
The implications of this research are significant, particularly in the context of biometric security and misinformation. As deepfake technology becomes more sophisticated, effective detection methods are crucial for safeguarding against potential abuses in various sectors, including finance and communication. The proposed FGFM approach could contribute to the development of more reliable detection systems, thereby enhancing trust in voice-based interactions. The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.
Primary: Technion--Israel Institute of Technology
All Institutions: Technion--Israel Institute of Technology, Carnegie Mellon University
The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
The methodology presented in this paper is robust and innovative, focusing on the generation of unsafe spoken dialogues and the evaluation of large audio-language models (LALMs) as safety judges. The controlled generation of unsafe dialogue variants, along with the systematic benchmarking of LALMs across different modalities, is a significant contribution to the field. The use of human raters to validate the generated unsafe dialogues and the severity scale adds credibility to the findings. The paper also effectively addresses the challenges of audio-specific cues and transcription errors, which are often overlooked in text-centric assessments.
The experimental evaluation is thorough, with a well-defined dataset of 24,000 dialogues and a clear methodology for assessing the performance of the LALMs. The results reveal important trade-offs between sensitivity, specificity, and stability across different models and modalities. The use of various prompting strategies to optimize performance further demonstrates a comprehensive approach to evaluating the models. However, the paper could benefit from more detailed statistical analysis and comparisons with existing benchmarks in the field.
The paper mentions plans to release the dataset and code, which is crucial for reproducibility. However, specific implementation details, such as the exact configurations used for the LALMs and the human raters' instructions, should be more explicitly stated to facilitate replication of the study. The inclusion of supplementary materials or appendices would enhance reproducibility.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world dialogues. Additionally, the potential for bias in the generated unsafe dialogues and the subjective nature of human ratings could impact the validity of the findings. The paper also acknowledges the risk of misuse of the benchmark data, which is an important ethical consideration.
The findings of this research have significant implications for the development of safer spoken dialogue systems and voice agents. By providing a systematic approach to evaluating harmful content in multi-turn dialogues, the work aims to improve the safety and reliability of voice interfaces. However, the potential for misuse of the generated data and the reliance on automated judges without human oversight could lead to unintended consequences in real-world applications. The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
The paper proposes a novel audio tokenizer, ReasoningCodec, which effectively separates audio representations into reasoning and reconstruction tokens. This dual-token approach allows for higher-level abstractions while maintaining fidelity in audio reconstruction. The architecture's functional layer specialization is a significant methodological advancement, optimizing the processing of audio and text tokens across different transformer layers, which is a departure from the traditional uniform approach. The introduction of auditory sentences as a means to unify task construction is innovative and enhances the model's ability to handle complex audio tasks.
The authors conducted extensive experiments across various speech, sound, and music tasks, demonstrating competitive performance on in-domain evaluations. The model's ability to generalize to unseen tasks in few-shot and zero-shot settings is particularly noteworthy, showcasing its robustness and versatility. However, the paper could benefit from more detailed quantitative results and comparisons with state-of-the-art models to better contextualize its performance.
The authors commit to providing demo, code, and checkpoints, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics and hyperparameter settings that would facilitate full reproducibility by other researchers.
The paper acknowledges potential risks associated with misuse of the technology, such as impersonation and copyright issues. However, it does not delve deeply into the technical limitations of the model itself, such as potential biases in the training data or the scalability of the approach to more complex audio tasks.
The proposed model has significant implications for applications in creative assistance, human-computer interaction, and audio generation. However, the authors rightly caution against potential misuse, emphasizing the need for responsible deployment practices to mitigate risks associated with audio generation technologies. The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this work is the development of UniAudio 2.0, a unified audio language model that effectively integrates understanding and generation tasks through innovative tokenization and architecture strategies. This paper represents a meaningful advancement in the field of audio language models, addressing key challenges and providing a robust framework for future research and applications.
The paper introduces a novel audio tokenizer, ReasoningCodec, which effectively separates audio into reasoning and reconstruction tokens, addressing the limitations of existing discrete tokenizers. The proposed unified autoregressive architecture with functional layer specialization enhances the model's ability to process both audio and text, allowing for improved understanding and generation. The introduction of auditory sentences as a method for multi-task training is particularly innovative, as it facilitates the integration of diverse audio tasks without the need for extensive manual task design.
The authors report extensive experiments on a large dataset comprising 100B text tokens and 60B audio tokens, demonstrating competitive performance on various tasks. The few-shot and zero-shot generalization capabilities are particularly noteworthy, indicating the model's robustness and versatility across different audio-related tasks. However, specific metrics and comparisons with baseline models could be more thoroughly detailed to strengthen the claims of performance.
The paper mentions that demo, code, and checkpoints will be made available, which is a positive aspect for reproducibility. However, the absence of a detailed description of the experimental setup, hyperparameters, and model training procedures limits the ease with which others can replicate the results.
The paper acknowledges potential risks associated with audio generation, such as misuse and copyright issues, but it could benefit from a more in-depth discussion of the limitations of the proposed model itself, including any biases in the training data or challenges in the generalization to highly diverse audio tasks.
The implications of this research are significant, as it opens avenues for advanced applications in creative assistance, human-computer interaction, and audio content generation. However, the authors rightly highlight the ethical considerations and potential for misuse, which need to be addressed as the technology develops. The main contribution of this work is the development of UniAudio 2.0, a unified audio language model that effectively integrates understanding and generation tasks through innovative tokenization and architecture strategies. This paper represents a meaningful advancement in the field of audio language models, addressing key challenges and providing a robust framework for future research and applications.
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
The proposed URSA-GAN framework presents a novel approach to address the challenges of domain adaptation in ASR and SE by leveraging a dual-embedding architecture that captures noise and channel characteristics. This method is innovative in its use of generative adversarial networks (GANs) combined with dynamic stochastic perturbation for enhanced robustness. The architecture is well-structured, with a clear delineation of roles for the noise encoder, channel encoder, and generator, which collectively facilitate effective domain adaptation. The introduction of instance-level embeddings and the use of feature-wise linear modulation (FiLM) for conditioning the generator on noise and channel characteristics are particularly noteworthy. However, the complexity of the model may pose challenges in practical applications.
The experiments conducted are extensive and cover a variety of datasets and scenarios, demonstrating the effectiveness of URSA-GAN in improving ASR and SE performance under mismatched conditions. The results show significant improvements in character error rates and perceptual metrics, validating the framework's robustness. The evaluation metrics used are appropriate, and the comparative analysis against baseline models and previous works strengthens the claims made by the authors. However, the paper could benefit from more detailed ablation studies to further clarify the contributions of individual components.
The paper provides a comprehensive description of the methodology, including the architecture, training process, and evaluation metrics, which facilitates reproducibility. However, the lack of a publicly available code repository or demo limits the ability of other researchers to replicate the experiments fully. Clearer documentation of hyperparameters and training configurations would enhance reproducibility.
One limitation is the reliance on pre-trained models for the noise and channel encoders, which may not generalize well to all domains. Additionally, the model's complexity could hinder its deployment in real-time applications, especially on resource-constrained devices. The performance gap between URSA-GAN and models trained on labeled target-domain data suggests that while the framework is effective, it may still require some labeled data for optimal performance.
The proposed framework has significant implications for real-world applications of ASR and SE, particularly in environments with varying noise and channel conditions. By improving the robustness of these systems, URSA-GAN could enhance user experiences in various domains, including telecommunications, voice assistants, and hearing aids. The approach also opens avenues for further research in domain adaptation techniques across different audio processing tasks. The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
The proposed methodology of PFluxTTS is innovative, combining a dual-decoder architecture that integrates both duration-guided and alignment-free models through inference-time vector-field fusion. This hybrid approach effectively addresses the stability-naturalness trade-off prevalent in existing TTS systems. The use of FLUX-based speech-prompt embeddings for robust cross-lingual voice cloning is a significant advancement, allowing the model to maintain speaker identity across languages without relying on prompt transcripts. Additionally, the integration of a modified PeriodWave vocoder with super-resolution capabilities to synthesize high-quality audio at 48 kHz from low-rate mel features is a noteworthy enhancement.
The experimental evaluation is comprehensive, utilizing a variety of datasets that reflect real-world challenges in TTS, particularly in cross-lingual scenarios. The authors provide both subjective and objective metrics to assess performance, demonstrating that PFluxTTS outperforms several state-of-the-art systems in terms of naturalness and speaker similarity. The use of statistical significance tests to validate the results adds rigor to the findings. However, the reliance on a limited number of baselines may restrict the generalizability of the conclusions.
The paper includes detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results fully. The authors could improve reproducibility by providing access to their training data and model checkpoints.
One limitation of the study is the potential overfitting to the specific datasets used for training and evaluation, which may not represent the full diversity of real-world speech. Additionally, while the system shows robustness in challenging conditions, the performance on extremely noisy or low-quality inputs is not thoroughly explored. The authors also note that the model's performance may vary with different languages, which could limit its applicability in multilingual contexts.
The advancements presented in PFluxTTS have significant implications for applications in AI dubbing, virtual assistants, and accessibility technologies. By improving cross-lingual voice cloning and audio quality, the system can enhance user experience in multilingual environments, making technology more inclusive. Furthermore, the research contributes to the ongoing development of high-fidelity TTS systems, which can benefit various industries, including entertainment, education, and customer service. The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
The methodology presented in this paper is robust and innovative, addressing the unique challenges of continual learning (CL) in audio contexts, particularly the upstream-downstream misalignment that has hindered previous approaches. The introduction of PACE, which combines improved first-session adaptation (FSA) with multi-session adaptation (MSA) and boundary-aware regularization, is a significant advancement. The paper meticulously details the design choices behind each component, demonstrating a clear understanding of the audio domain's intricacies. The use of analytic classifiers and adaptive subspace-orthogonal PEFT is particularly noteworthy, as it showcases a tailored approach to audio CL that diverges from traditional vision-based methods.
The experimental evaluation is thorough, employing six diverse audio CL benchmarks that effectively highlight the strengths and weaknesses of the proposed method. The results consistently demonstrate that PACE outperforms state-of-the-art methods, providing strong empirical evidence for its effectiveness. The ablation studies further reinforce the validity of the proposed components, illustrating how each contributes to the overall performance. However, the paper could benefit from additional comparisons with more recent methods in the audio domain, if available.
The authors commit to releasing their code and benchmarks, which is a positive aspect for reproducibility. The detailed descriptions of the experimental setup, including hyperparameters and dataset configurations, enhance the likelihood that other researchers can replicate the results. However, the absence of a demo or interactive component limits immediate accessibility for broader audiences.
One limitation is the potential for overfitting in fine-grained tasks, as indicated by the authors. The paper also acknowledges that while PACE narrows the gap to joint training, it does not completely eliminate it, suggesting that further improvements could be made. Additionally, the reliance on specific pretrained models may limit the generalizability of the findings across different audio tasks.
The implications of this work are significant, particularly for applications in speech recognition, audio event detection, and environmental sound understanding. By addressing the challenges of continual learning in audio, the proposed methods could enhance the robustness and adaptability of audio models in real-world scenarios, leading to more effective and reliable systems. The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
We propose a data-driven sparse recovery framework for hybrid spherical linear microphone arrays using singular value decomposition (SVD) of the transfer operator. The SVD yields orthogonal microphone and field modes, reducing to spherical harmonics (SH) in the SMA-only case, while incorporating LMAs introduces complementary modes beyond SH. Modal analysis reveals consistent divergence from SH across frequency, confirming the improved spatial selectivity. Experiments in reverberant conditions show reduced energy-map mismatch and angular error across frequency, distance, and source count, outperforming SMA-only and direct concatenation. The results demonstrate that SVD-modal processing provides a principled and unified treatment of hybrid arrays for robust sparse sound-field reconstruction.
Primary: The University of Sydney
All Institutions: The University of Sydney
The main contribution of this paper is the introduction of a unified SVD-modal framework for sparse sound field reconstruction using hybrid microphone arrays, which significantly improves spatial selectivity and robustness in reverberant environments. This work provides a principled approach that advances the state of the art in audio processing and sound field analysis, addressing key limitations of existing methods.
The proposed methodology leverages singular value decomposition (SVD) to derive a unified modal solution for sound field reconstruction using hybrid spherical-linear microphone arrays. This approach is innovative as it generalizes existing spherical harmonic (SH) processing while introducing complementary modes from linear microphone arrays (LMAs). The paper effectively integrates theoretical foundations with practical applications, demonstrating a clear understanding of the challenges posed by reverberant environments and the limitations of previous methods. The modal analysis and the use of a well-conditioned dictionary for sparse recovery are particularly noteworthy, as they provide a robust framework for addressing the underdetermined nature of the problem.
The experimental evaluation is comprehensive, utilizing simulations in reverberant conditions to assess the performance of the proposed method against baseline techniques such as SMA-only and residue refinement. The metrics employed, including energy map mismatch and angular error, are appropriate for the task and provide a clear indication of the method's effectiveness. The results consistently demonstrate the advantages of the SVD-modal framework, particularly in terms of spatial accuracy and robustness under varying conditions, which strengthens the paper's claims.
The paper lacks specific implementation details that would facilitate reproducibility, such as access to the datasets used for training and testing, or the code for the proposed algorithm. While the methodology is well described, the absence of a project URL or demo limits the ability of other researchers to replicate the findings. Clearer documentation and sharing of resources would enhance reproducibility.
One limitation of the study is the reliance on simulated environments, which may not fully capture the complexities of real-world acoustic conditions. Additionally, the trade-off between energy-map fidelity and localization accuracy when varying the number of modes could be further explored. The paper suggests future work on optimal mode selection, indicating that the current approach may not be universally applicable across all scenarios.
The proposed framework has significant implications for audio processing applications, particularly in environments where accurate sound field reconstruction is critical, such as in virtual reality, augmented reality, and advanced audio capture technologies. By improving the spatial resolution and robustness of sound field reconstruction, this work could enhance user experiences in immersive audio applications and contribute to advancements in spatial audio technologies. The main contribution of this paper is the introduction of a unified SVD-modal framework for sparse sound field reconstruction using hybrid microphone arrays, which significantly improves spatial selectivity and robustness in reverberant environments. This work provides a principled approach that advances the state of the art in audio processing and sound field analysis, addressing key limitations of existing methods.
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, Wuhan University, The Hong Kong University of Science and Technology (Guangzhou), The University of Auckland
The paper presents a pioneering approach to emotional TTS through activation steering, significantly advancing the field by enabling composable emotional expression and challenging existing paradigms in TTS architecture. The methodology is innovative, and while the experimental results are promising, further validation and implementation details would strengthen the contributions to the field.
The paper introduces a novel framework for emotional TTS that leverages activation steering via latent direction vectors. This approach is significant as it allows for composable and controllable emotional expression, addressing the limitations of existing TTS systems that typically enforce a single emotion per utterance. The methodology is well-structured, systematically analyzing the linear steerability of emotion representations and proposing a quantitative steering framework. The introduction of multi-rater evaluation protocols is particularly noteworthy, as it enhances the assessment of emotional synthesis quality.
The experiments conducted are robust, demonstrating the effectiveness of the proposed method in generating mixed-emotion synthesis and addressing text-emotion mismatches. The results indicate that emotional prosody is primarily synthesized by the TTS language module, which is a significant finding that challenges previous assumptions about TTS architecture. However, the paper could benefit from more extensive datasets and comparisons with state-of-the-art systems to further validate the claims.
The paper lacks detailed implementation information that would facilitate reproducibility. While the methodology is described, the absence of specific parameters, datasets, and code availability limits the ability of other researchers to replicate the results. Including a supplementary material section with these details would enhance the paper's reproducibility.
One limitation of the study is the potential overfitting to the datasets used for training and evaluation, which may not generalize well to all types of emotional speech. Additionally, the paper does not thoroughly address the computational efficiency of the proposed method, which is crucial for real-time applications.
The implications of this research are significant for various applications, including virtual assistants, gaming, and mental health support systems, where nuanced emotional expression can enhance user experience. The ability to generate human-like emotional speech can lead to more engaging and relatable interactions in AI systems. The paper presents a pioneering approach to emotional TTS through activation steering, significantly advancing the field by enabling composable emotional expression and challenging existing paradigms in TTS architecture. The methodology is innovative, and while the experimental results are promising, further validation and implementation details would strengthen the contributions to the field.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Primary: Meta
All Institutions: Meta, Institut Polytechnique de Paris
The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative approaches. This work significantly advances the field by providing a more coherent and integrated method for audio-visual alignment, with promising applications across multiple domains.
The proposed Conditional Flow Matching (CFM) framework represents a significant methodological advancement by reframing visually-guided acoustic highlighting as a generative problem rather than a discriminative one. This shift allows for a more nuanced approach to audio remixing, addressing the inherent ambiguities present in the task. The introduction of a rollout loss to mitigate prediction errors during iterative flow-based generation is a clever solution to the problem of trajectory drift, enhancing the stability of the model. The conditioning module that integrates audio and visual cues is also a noteworthy innovation that enables more effective cross-modal source selection.
The paper provides extensive quantitative and qualitative evaluations, demonstrating that the CFM framework consistently outperforms existing state-of-the-art methods. The experimental design appears robust, utilizing a variety of datasets to validate the effectiveness of the proposed approach. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated upon to strengthen the findings.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While the methodology is described, there are no links to code repositories or supplementary materials that would allow other researchers to replicate the experiments. Providing such resources would significantly enhance the paper's impact and utility in the research community.
One limitation is the potential for the model to overfit to the training data, especially given the complexity of the generative task. Additionally, the paper does not address the computational efficiency of the proposed method, which could be a concern for real-time applications. The reliance on visual cues may also limit the model's applicability in scenarios where visual information is not available or is of low quality.
The implications of this research are substantial, particularly in fields such as multimedia content creation, virtual reality, and assistive technologies for the hearing impaired. By improving the alignment of audio and visual elements, the proposed framework could enhance user experiences in various applications, making it a valuable contribution to the intersection of audio processing and machine learning. The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative approaches. This work significantly advances the field by providing a more coherent and integrated method for audio-visual alignment, with promising applications across multiple domains.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Primary: Institut Polytechnique de Paris
All Institutions: Meta, Institut Polytechnique de Paris
The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative models. The innovative methodology, combined with promising experimental results, positions this work as a significant advancement in the intersection of audio and visual machine learning.
The proposed Conditional Flow Matching (CFM) framework represents a significant methodological shift from traditional discriminative models to a generative approach for visually-guided acoustic highlighting. The introduction of a rollout loss to mitigate error propagation in iterative flow-based generation is an innovative solution to a common problem in generative modeling. Additionally, the conditioning module that integrates audio and visual cues before vector field regression is a thoughtful enhancement that allows for explicit cross-modal source selection, which is crucial for the task at hand.
The authors conducted extensive quantitative and qualitative evaluations, demonstrating that their method consistently outperforms the previous state-of-the-art discriminative approach. However, the paper would benefit from a more detailed description of the datasets used, including their size, diversity, and relevance to the task. The evaluation metrics employed should also be clearly defined to allow for reproducibility and comparison with future work.
The paper lacks sufficient implementation details that would allow other researchers to reproduce the results. While the methodology is described, specifics regarding hyperparameters, training procedures, and the computational resources used are not provided. Including a supplementary material section with this information or a link to a code repository would significantly enhance reproducibility.
One limitation of the proposed method is its reliance on the quality of the visual input, which may not always be reliable in real-world scenarios. Additionally, the complexity of the model may lead to longer inference times, which could be a drawback for real-time applications. The authors should also address potential overfitting issues, especially given the generative nature of the approach.
The implications of this research extend beyond audio-visual alignment, potentially influencing fields such as multimedia content creation, augmented reality, and assistive technologies for the hearing impaired. By improving the coherence between audio and visual stimuli, this work could enhance user experiences in various applications, making it a valuable contribution to the field. The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative models. The innovative methodology, combined with promising experimental results, positions this work as a significant advancement in the intersection of audio and visual machine learning.
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of physiological sounds caused by the occlusion effect; however, existing approaches often fail under real-world noise or rely on computationally expensive models. We present EarResp-ANS, the first system enabling fully on-device, real-time RR estimation on commercial earphones. The system employs LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components, without requiring neural networks or audio streaming, thereby explicitly addressing the energy and privacy constraints of wearable devices. We evaluate EarResp-ANS in a study with 18 participants under realistic acoustic conditions, including music, cafeteria noise, and white noise up to 80 dB SPL. EarResp-ANS achieves robust performance with a global MAE of 0.84 CPM , reduced to 0.47 CPM via automatic outlier rejection, while operating with less than 2% processor load directly on the earphone.
Primary: Karlsruhe Institute of Technology
All Institutions: Karlsruhe Institute of Technology
The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
The methodology presented in EarResp-ANS is innovative, leveraging LMS-based adaptive noise suppression to enhance the accuracy of respiration rate estimation from in-ear audio signals. The decision to avoid neural networks and audio streaming is commendable, as it addresses energy efficiency and privacy concerns, which are critical in wearable technology. The paper provides a clear description of the signal processing techniques used, although further details on the implementation specifics would enhance understanding.
The experimental setup is robust, involving 18 participants and testing under various realistic acoustic conditions. The reported results, including a global MAE of 0.84 CPM and improved performance with outlier rejection, demonstrate the system's effectiveness. However, the sample size could be considered limited for broader generalizability, and additional metrics could provide a more comprehensive performance evaluation.
The paper lacks sufficient detail regarding the implementation of the system, which could hinder reproducibility. While the methodology is described, specific parameters, configurations, and the dataset used for training and validation are not thoroughly detailed, making it challenging for other researchers to replicate the study.
One limitation is the relatively small participant pool, which may not capture the variability in respiration rates across different demographics. Additionally, the performance under extreme noise conditions could be further explored, as the current evaluation focuses on a limited range of acoustic environments.
The potential applications of this technology are significant, particularly in health monitoring and wellness, as it allows for unobtrusive and continuous monitoring of a vital sign that is often overlooked. The system's design prioritizes user privacy and energy efficiency, making it suitable for widespread adoption in consumer devices. The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary benchmark task suites: NatHEAR and RealSELD. Our results demonstrate that GRAM outperforms all state-of-the-art self-supervised audio foundation models on NatHEAR and the clean, single-channel version HEAR, while using only a fraction of the training data. GRAM also shows state-of-the-art localization performance in simulated environments and generalizes efficiently to real-world recordings in RealSELD. Taken together, GRAM presents a significant advance toward robust spatial audio foundation models for real-world environments.
Primary: Donders Institute, Radboud University
All Institutions: Donders Institute, Radboud University, Mortimer B Zuckerman Institute, Columbia University
The paper presents GRAM, a significant advancement in spatial audio representation, demonstrating state-of-the-art performance in real-world environments while addressing the limitations of existing audio foundation models. The comprehensive methodology and rigorous evaluation contribute to its potential impact on the field of machine learning and audio processing.
The paper presents GRAM, a multi-channel masked autoencoder designed to learn spatial audio representations. The methodology is well-structured, employing a novel training pipeline that utilizes high-quality simulations of real-world sound environments. The use of a masked autoencoder to reconstruct spatial audio features is innovative, particularly in the context of audio foundation models, which typically overlook spatial dimensions. The introduction of two benchmark suites, NatHEAR and RealSELD, adds significant value by providing standardized evaluation metrics for audio models in complex environments.
The experiments are comprehensive, comparing GRAM against state-of-the-art models across various tasks in both simulated and real-world environments. The results demonstrate GRAM's superior performance in sound localization and general-purpose audio representation tasks, achieving state-of-the-art results while requiring less training data. The inclusion of ablation studies further strengthens the evaluation by providing insights into the impact of different model components and training strategies.
The paper provides sufficient details regarding the training process, model architecture, and evaluation metrics, which enhances reproducibility. The authors have made their code and datasets available, which is a positive aspect for the community. However, some specific hyperparameter settings and configurations could be more explicitly detailed to facilitate easier replication of results.
One limitation noted is the inadequate resolution of mel-spectrograms for binaural inputs, which may have impacted localization performance. Additionally, while the model shows promise in real-world applications, its performance in highly complex acoustic environments with significant noise interference remains to be fully explored.
The advancements made by GRAM could significantly impact various applications, including audio-visual scene understanding, robotics, and ambient intelligence systems. By improving the robustness of audio models in real-world environments, this work could enhance user experiences in smart environments and contribute to the development of more sophisticated auditory perception systems. The paper presents GRAM, a significant advancement in spatial audio representation, demonstrating state-of-the-art performance in real-world environments while addressing the limitations of existing audio foundation models. The comprehensive methodology and rigorous evaluation contribute to its potential impact on the field of machine learning and audio processing.
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this work is the introduction of a dual-pathway audio encoder that effectively captures both semantic and dynamic audio features for improved video highlight detection. This innovative approach not only sets a new benchmark in performance but also addresses critical limitations in existing methodologies, paving the way for future research in audio-visual learning.
The proposed methodology, DAViHD, introduces a dual-pathway audio encoder that effectively disentangles audio signals into semantic and dynamic components. This innovative approach allows for a more nuanced understanding of audio features, addressing a significant gap in existing models that often overlook the dynamic characteristics of sound. The use of frequency-adaptive mechanisms and the integration of self-attention in the audio feature fusion process are notable advancements that enhance the model's ability to capture salient moments in videos.
The experimental setup is robust, utilizing large-scale datasets (TVSum and Mr.HiSum) to validate the proposed model. The results demonstrate significant improvements over baseline models, achieving state-of-the-art performance metrics. The thorough comparison against various existing methods, including both audio-visual and visual-only models, strengthens the credibility of the findings. Additionally, the ablation studies provide clear insights into the contributions of different components of the model.
The paper provides detailed implementation details, including the architecture of the model, training parameters, and the datasets used. This level of transparency is crucial for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
While the paper presents a compelling case for the dual-pathway approach, it does not extensively discuss potential limitations or scenarios where the model may underperform. Additionally, the reliance on pre-trained models for feature extraction could introduce biases from those models, which should be acknowledged.
The advancements in audio-visual highlight detection have significant implications for various applications, including content summarization, video retrieval, and recommendation systems. By improving the understanding of audio dynamics, this research could enhance user experiences in multimedia applications, making it a valuable contribution to the field. The main contribution of this work is the introduction of a dual-pathway audio encoder that effectively captures both semantic and dynamic audio features for improved video highlight detection. This innovative approach not only sets a new benchmark in performance but also addresses critical limitations in existing methodologies, paving the way for future research in audio-visual learning.
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Primary: Seoul National University
All Institutions: Seoul National University
The paper presents a novel approach to audio-visual video highlight detection through the DAViHD framework, which effectively models both semantic and dynamic audio features, achieving state-of-the-art performance and demonstrating the importance of sophisticated audio representations in multimedia understanding.
The proposed Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD) framework effectively disentangles audio signals into semantic and dynamic pathways, addressing a significant gap in existing models that often overlook the rich spectro-temporal dynamics of audio. The architecture employs a frequency-adaptive mechanism, allowing it to capture transient acoustic events, which is a notable advancement over traditional methods that rely on high-level semantic features. The integration of self-attention mechanisms before fusion enhances the model's ability to contextualize audio features, making the methodology both innovative and robust.
The experiments are conducted on well-established benchmarks (TVSum and Mr.HiSum), with the model achieving state-of-the-art results. The use of rigorous evaluation metrics, including F1-score and mean Average Precision (mAP), demonstrates the model's effectiveness in accurately identifying video highlights. The ablation studies provide strong evidence for the contributions of each component, particularly the dual-pathway approach, which significantly outperforms baseline models.
The paper provides detailed implementation details, including the architecture, training protocols, and hyperparameters, which are essential for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
While the model shows impressive performance, it may still be sensitive to the quality of audio input and may not generalize well to videos with poor audio quality or significant background noise. Additionally, the reliance on pre-trained models for audio semantic encoding may introduce biases based on the training data of those models.
The advancements in audio-visual highlight detection have significant implications for content summarization, retrieval, and recommendation systems, enhancing user experiences in various applications such as video streaming platforms and educational content delivery. The methodology could also inspire further research into multi-modal learning frameworks that leverage diverse data types for improved understanding. The paper presents a novel approach to audio-visual video highlight detection through the DAViHD framework, which effectively models both semantic and dynamic audio features, achieving state-of-the-art performance and demonstrating the importance of sophisticated audio representations in multimedia understanding.
Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland, Université PSL, Université de Paris, University of Chinese Academy of Sciences, University of Toronto
The WST-X series presents a novel and effective approach to speech deepfake detection by leveraging wavelet scattering transforms and self-supervised learning features. This work significantly advances the field by addressing the critical need for interpretable and robust detection methods in audio forensics.
The paper introduces the WST-X series, a novel approach that effectively combines wavelet scattering transforms with self-supervised learning features for speech deepfake detection. The methodology is well-structured, detailing the theoretical foundations of the wavelet scattering transform and its integration with SSL features. The dual-branch architecture (WST-X1 and WST-X2) is innovative, allowing for both parallel and cascaded processing of features, which enhances the model's ability to capture subtle acoustic artifacts. The careful selection of parameters (J, Q, M for 1D and J, L, M for 2D) demonstrates a thorough understanding of the underlying signal characteristics and their relevance to deepfake detection.
The experimental setup is robust, utilizing the challenging Deepfake-Eval-2024 dataset, which is representative of real-world scenarios. The performance metrics chosen (minDCF, EER, F1-score, AUC) are appropriate for evaluating the effectiveness of the proposed methods. The results indicate significant performance improvements over traditional feature extraction methods, showcasing the advantages of the WST-X series in capturing fine-grained spectral anomalies. However, the paper could benefit from more extensive comparisons with other state-of-the-art methods beyond the baseline features mentioned.
The paper provides sufficient detail on the implementation of the WST-X series, including the choice of libraries (Kymatio, Librosa) and model configurations. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider making the code accessible to facilitate further research and validation.
One limitation is the reliance on the Deepfake-Eval-2024 dataset, which may not encompass all potential variations in deepfake generation techniques. Additionally, while the paper emphasizes interpretability, the complexity of the model may still pose challenges in fully understanding the decision-making process of the classifier. The paper does not address potential overfitting issues that may arise from the high-dimensional feature space.
The proposed WST-X series has significant implications for audio forensics and the detection of deepfake technologies, which are increasingly relevant in today's digital landscape. By improving the interpretability and robustness of speech deepfake detection systems, this work contributes to the ongoing efforts to combat misinformation and ensure the integrity of audio content. The WST-X series presents a novel and effective approach to speech deepfake detection by leveraging wavelet scattering transforms and self-supervised learning features. This work significantly advances the field by addressing the critical need for interpretable and robust detection methods in audio forensics.
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Ping An Technology (Shenzhen) Co
The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
The proposed PL-Distill framework introduces a dual-level knowledge distillation approach that effectively addresses the challenges of distilling large audio-language models for speech emotion recognition. The incorporation of Attention-weighted Centered Kernel Alignment (AwCKA) is particularly innovative, as it dynamically prioritizes important audio tokens based on attention scores, thereby enhancing the alignment of audio embeddings despite dimensional mismatches. This methodological advancement is well-justified in the context of previous work and represents a significant contribution to the field of knowledge distillation in multimodal models.
The experimental evaluation is robust, utilizing three widely recognized datasets (IEMOCAP, RAVDESS, and SAVEE) to validate the effectiveness of the proposed method. The results demonstrate that PL-Distill not only compresses the teacher model significantly but also outperforms both the teacher and state-of-the-art models across all metrics. The ablation studies further substantiate the contributions of each component of the framework, providing a clear understanding of the impact of the proposed methods.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully generalize to other SER tasks or datasets. Additionally, while the method shows promise, the computational efficiency of the distillation process itself could be further explored to ensure practical applicability in real-world scenarios.
The implications of this research extend beyond speech emotion recognition, as the PL-Distill framework could be adapted for various audio-language tasks, potentially improving the efficiency of deploying large models in resource-constrained environments. The focus on effective knowledge transfer in multimodal contexts may also inspire future research in related areas. The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at https://github.com/HuPER29/HuPER.
Primary: University of California, Berkeley
All Institutions: Zhejiang University, University of California, Berkeley
HuPER presents a novel framework for phonetic perception that integrates adaptive inference with acoustic and linguistic knowledge, achieving state-of-the-art performance with limited training data. The methodology is robust, and the implications for practical applications in speech technology are substantial, marking a significant advancement in the field.
The methodology proposed in HuPER is innovative as it frames phonetic perception as adaptive inference, integrating both acoustic-phonetic evidence and linguistic knowledge. The four-stage training pipeline is well-structured, starting from a small annotated corpus and leveraging a larger transcript-only corpus for pseudo-label generation. The use of a Corrector model to learn edit operations is particularly noteworthy, as it enhances the robustness of the phonetic recognizer. This adaptive approach allows for multi-path phonetic perception under varying acoustic conditions, which is a significant advancement in the field.
The experiments conducted are comprehensive, with the framework achieving state-of-the-art phonetic error rates across five English benchmarks and demonstrating strong zero-shot transfer capabilities to 95 unseen languages. The choice of datasets and benchmarks appears appropriate for validating the performance claims. However, more detailed comparisons with existing state-of-the-art methods would strengthen the evaluation.
The authors have made all training data, models, and code open-sourced, which is commendable and enhances reproducibility. The provided GitHub repository allows other researchers to replicate the experiments and build upon the work. However, additional documentation on the training process and hyperparameter settings would further facilitate reproducibility.
One limitation of the study is the reliance on the initial small human-annotated corpus, which may not capture the full diversity of phonetic variations across different languages. Additionally, while the zero-shot transfer to 95 languages is impressive, the paper does not provide extensive analysis on the performance across these languages, which could vary significantly in phonetic structure.
The potential applications of HuPER are vast, particularly in assistive technologies for education, healthcare, and accessibility. By improving the reliability of phonetic representations, the framework could lead to more effective communication tools for diverse populations. The work also lays a foundation for future developments in speech generation systems, making it a significant contribution to the field of speech and language technologies. HuPER presents a novel framework for phonetic perception that integrates adaptive inference with acoustic and linguistic knowledge, achieving state-of-the-art performance with limited training data. The methodology is robust, and the implications for practical applications in speech technology are substantial, marking a significant advancement in the field.
Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics, including global and local pitch deviations, energy consistency, and speaker similarity, compared to prior approaches.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this work is the introduction of LipSody, a novel lip-to-speech synthesis framework that enhances prosody consistency through a multi-faceted approach to visual input. This paper represents a meaningful advancement in the field of audio synthesis, providing a robust methodology and comprehensive evaluation that could influence future research and applications in multimodal speech synthesis.
The methodology presented in LipSody is innovative, leveraging a diffusion-based framework to enhance prosody consistency in lip-to-speech synthesis. The authors introduce a novel prosody-guiding strategy that integrates speaker identity, linguistic content, and emotional context, which is a significant advancement over previous models that primarily focused on intelligibility. The use of complementary cues for prosody estimation is a thoughtful approach that enhances the model's ability to generate more natural and expressive speech. The architecture is well-structured, utilizing established deep learning techniques while introducing new components like the Emotion Encoder to refine prosody prediction.
The experimental evaluation is thorough, utilizing a large dataset (LRS3) and employing both objective and subjective metrics to assess performance. The results demonstrate significant improvements in prosody-related metrics compared to prior models, while maintaining intelligibility. The use of statistical tests to validate the significance of improvements adds rigor to the findings. However, the paper could benefit from additional comparisons with more recent models beyond LipVoicer to contextualize its contributions further.
The paper provides detailed implementation specifics, including model architecture, training protocols, and evaluation metrics, which support reproducibility. The authors mention using publicly available codebases for components like the Emotion Encoder and vocoder, which enhances the potential for others to replicate their work. However, the lack of a publicly available code repository for the entire LipSody framework limits full reproducibility.
One limitation is the reliance on the LRS3 dataset, which may not encompass the full diversity of lip movements and emotional expressions found in real-world scenarios. Additionally, while the model shows improvements in prosody consistency, the subjective evaluations indicate that the differences in naturalness are not statistically significant, suggesting that further enhancements could be explored. The model's performance in diverse acoustic environments or with different speaker demographics remains untested.
LipSody has significant potential applications in areas such as assistive technologies for the hearing impaired, silent communication tools, and enhancing multimedia content accessibility. The ability to generate expressive and personalized speech from visual input could also benefit virtual avatars and gaming industries, where realistic character interactions are crucial. The advancements in prosody consistency could lead to more engaging and relatable AI-generated speech, fostering better human-computer interactions. The main contribution of this work is the introduction of LipSody, a novel lip-to-speech synthesis framework that enhances prosody consistency through a multi-faceted approach to visual input. This paper represents a meaningful advancement in the field of audio synthesis, providing a robust methodology and comprehensive evaluation that could influence future research and applications in multimodal speech synthesis.
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
Primary: University of Illinois, Urbana-Champaign
All Institutions: University of Illinois, Urbana-Champaign, AWS AI Labs
The main contribution of this paper is the development of a masked autoencoder framework for universal speech enhancement that effectively handles multiple distortions through self-supervised learning. This work presents a novel approach that not only advances the state of the art in speech enhancement but also opens avenues for further research in self-supervised learning applications in audio processing.
The paper introduces a masked autoencoder framework for speech enhancement that is both self-supervised and capable of handling various distortions. The methodology is well-structured, leveraging an augmentation stack to introduce additional noise, which is a clever approach to pre-training. The dual focus on denoising and dereverberation tasks demonstrates versatility. However, the paper could benefit from a more thorough comparison with existing methods beyond the baseline, as well as a clearer explanation of the specific architecture choices made in the masked autoencoder.
The experiments are comprehensive, evaluating the model on both in-domain and out-of-domain datasets, which is crucial for assessing generalizability. The results indicate that the proposed method achieves state-of-the-art performance, which is a significant contribution. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance testing, which would strengthen the claims made.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the methodology is described, the absence of a clear protocol for reproducing the results limits the ability of other researchers to validate the findings.
One limitation is the reliance on a small amount of paired data for fine-tuning, which may not be feasible in all practical scenarios. Additionally, the paper does not address potential biases in the datasets used for evaluation, which could affect the generalizability of the results.
The proposed method has significant implications for real-world applications in speech enhancement, particularly in environments with varying types of noise. The ability to enhance speech across different distortions makes it a valuable tool for improving communication in challenging acoustic settings, such as in teleconferencing or assistive technologies for the hearing impaired. The main contribution of this paper is the development of a masked autoencoder framework for universal speech enhancement that effectively handles multiple distortions through self-supervised learning. This work presents a novel approach that not only advances the state of the art in speech enhancement but also opens avenues for further research in self-supervised learning applications in audio processing.
Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech enhancement (ParaGSE) framework that leverages a group vector quantization (GVQ)-based neural speech codec. The GVQ-based codec adopts separate VQs to produce mutually independent tokens, enabling efficient parallel token prediction in ParaGSE. Specifically, ParaGSE leverages the GVQ-based codec to encode degraded speech into distinct tokens, predicts the corresponding clean tokens through parallel branches conditioned on degraded spectral features, and ultimately reconstructs clean speech via the codec decoder. Experimental results demonstrate that ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines, under a wide range of distortions including noise, reverberation, band-limiting, and their mixtures. Furthermore, empowered by parallel computation in token prediction, ParaGSE attains about a 1.5-fold improvement in generation efficiency on CPU compared with serial generative speech enhancement approaches.
Primary: University of Science and Technology of China
All Institutions: National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China
The paper presents ParaGSE, a novel framework for parallel generative speech enhancement that leverages a GVQ-based neural speech codec to achieve significant improvements in speech quality and processing efficiency. The technical contributions are substantial, addressing key challenges in the field and demonstrating the potential for practical applications in real-world scenarios.
The proposed methodology, ParaGSE, introduces a novel framework for generative speech enhancement that utilizes a group vector quantization (GVQ)-based neural speech codec. This approach is innovative in its use of separate VQs for independent token generation, which facilitates efficient parallel computation. The architecture is well-structured, employing a combination of convolutional layers, BiLSTM, and Conformer blocks to extract features and predict clean tokens. The methodology is sound, with a clear explanation of the components and their interactions, although it could benefit from more detailed comparisons with existing methods in terms of computational complexity.
The experimental evaluation is robust, featuring a comprehensive set of experiments that assess the performance of ParaGSE against various baseline models across multiple distortion types. The paper includes both objective and subjective metrics, providing a well-rounded view of the model's effectiveness. The dataset construction is thorough, utilizing real-world noise and reverberation conditions, which enhances the relevance of the findings. However, the paper could improve by including more detailed statistical analyses of the results and discussing the significance of the findings more explicitly.
The paper provides sufficient implementation details, including the architecture, training criteria, and experimental setup, which aids in reproducibility. The availability of codes and speech samples on the provided URL is a positive aspect, although the lack of a direct GitHub repository may limit accessibility for some researchers.
One limitation is the potential complexity of the model, which may hinder deployment in real-time applications. Additionally, while the paper claims efficiency improvements, it does not provide a detailed comparison of the computational costs associated with the proposed method versus the baselines, which could be crucial for practical applications. There is also a noted performance gap in intrusive metrics like LSD compared to discriminative models, which could be a concern for certain applications.
The proposed ParaGSE framework has significant potential for real-world applications in speech enhancement, particularly in environments with various distortions. Its efficiency and ability to produce high-quality speech restoration could benefit communication technologies, assistive listening devices, and speech recognition systems. The advancements in generative models for speech enhancement also contribute to the broader field of audio processing and machine learning. The paper presents ParaGSE, a novel framework for parallel generative speech enhancement that leverages a GVQ-based neural speech codec to achieve significant improvements in speech quality and processing efficiency. The technical contributions are substantial, addressing key challenges in the field and demonstrating the potential for practical applications in real-world scenarios.
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
Primary: University of Addis Ababa
All Institutions: University of Addis Ababa, Makerere University, University of Ghana, Digital Umuganda, Media Trust
The WAXAL dataset represents a significant advancement in addressing the digital divide for Sub-Saharan African languages. The comprehensive methodology and ethical considerations underscore its potential to foster inclusive technological development and support linguistic diversity in speech technology.
The methodology for data collection is robust, involving partnerships with local institutions to ensure cultural relevance and linguistic accuracy. The use of image-prompted speech for ASR data collection is innovative, as it encourages more natural speech patterns compared to scripted readings. The detailed steps in the transcription and quality control processes further enhance the dataset's reliability. The TTS data collection is also well-structured, focusing on high-quality recordings in a controlled environment. The collaborative approach with local experts is commendable and addresses ethical considerations effectively.
The paper provides a comprehensive overview of the dataset, including the amount of data collected and the diversity of languages represented. However, it lacks specific experimental results demonstrating the performance of models trained on this dataset, which would have strengthened the technical impact. The statistical analysis of the dataset is thorough, providing valuable insights into its composition, but the absence of comparative evaluations with existing datasets limits the assessment of its relative quality.
The paper outlines a clear methodology for data collection and processing, which aids reproducibility. However, it does not provide implementation details or code for the data collection process, which could hinder others from replicating the study. The dataset is openly accessible, which is a positive aspect for reproducibility in research.
The paper acknowledges several limitations, including transcription coverage and dialectal representation, which are significant in the context of the linguistic diversity of the region. The potential for unintended content in the ASR dataset is also a concern, despite quality control measures. Additionally, the dataset's focus on specific languages may not fully capture the linguistic richness of Sub-Saharan Africa.
The WAXAL dataset has the potential to significantly impact the development of speech technologies for underrepresented languages, promoting inclusivity in digital communication. By providing a large-scale resource, it can catalyze research and development in ASR and TTS systems, ultimately benefiting millions of speakers of these languages. The ethical considerations addressed in the paper also highlight the importance of responsible data usage, which is crucial in the context of AI and machine learning. The WAXAL dataset represents a significant advancement in addressing the digital divide for Sub-Saharan African languages. The comprehensive methodology and ethical considerations underscore its potential to foster inclusive technological development and support linguistic diversity in speech technology.
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
Primary: Addis Ababa University
All Institutions: Addis Ababa University, Makerere University, University of Ghana, Digital Umuganda, Media Trust
The WAXAL dataset represents a significant advancement in addressing the scarcity of speech resources for Sub-Saharan African languages. Its comprehensive methodology and potential for fostering inclusive technologies underscore its importance in the field of machine learning and speech technology.
The methodology for data collection is robust, involving partnerships with local institutions to ensure cultural relevance and linguistic accuracy. The use of image-prompted speech for ASR data collection is innovative, as it encourages more natural speech compared to traditional scripted methods. The TTS data collection also follows a well-structured approach with phonetically balanced scripts and professional recording environments, which enhances the quality of the dataset.
The paper provides a comprehensive overview of the dataset, including the amount of data collected for each language and the diversity of speakers. However, it lacks detailed experimental results demonstrating the performance of models trained on the WAXAL dataset. While the dataset itself is a significant contribution, the absence of empirical evaluations limits the assessment of its effectiveness in real-world applications.
The paper outlines the data collection process and quality control measures, which are essential for reproducibility. However, it would benefit from providing more detailed information on the specific tools and techniques used for transcription and annotation, as well as any baseline models tested on the dataset.
The authors acknowledge several limitations, including the potential for dialectal representation issues and the risk of unintended content in the ASR dataset. Additionally, the ASR dataset is not well-suited for training high-quality single-speaker TTS models, which may restrict its applicability in certain contexts.
The WAXAL dataset has the potential to significantly impact the development of speech technologies for underrepresented languages in Sub-Saharan Africa. By providing a large-scale, openly accessible resource, it can catalyze research and development efforts aimed at bridging the digital divide for speakers of these languages. The ethical considerations addressed in the paper also highlight the importance of responsible data handling and community involvement. The WAXAL dataset represents a significant advancement in addressing the scarcity of speech resources for Sub-Saharan African languages. Its comprehensive methodology and potential for fostering inclusive technologies underscore its importance in the field of machine learning and speech technology.
The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from "garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model's loss reacting positively to these perturbations, specifically a sharp increase ("Peak" area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve -- rather than its absolute value -- encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality -- opening the door to more principled training objectives and sharper benchmarks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a new evaluation framework for music LLMs based on the dynamics of loss curves, which reveals critical insights into model behavior and challenges existing evaluation paradigms. The findings underscore the need for a shift in how we assess musical quality in generative models, emphasizing the importance of understanding local dynamics over absolute loss values.
The paper introduces a novel methodology through the noise injection experiment, which challenges the conventional understanding of likelihood-based evaluation in music LLMs. The approach is well-structured, employing controlled noise perturbations to investigate the loss dynamics of models. The identification of the "Context Amnesia Effect" is a significant conceptual contribution, providing a new lens through which to understand model behavior in the presence of noise. The methodology is rigorous, with clear definitions and a systematic approach to analyzing loss dynamics.
The experiments are comprehensive, utilizing multiple datasets and various types of noise injections to validate the findings. The statistical analyses, including Pearson and Spearman correlation tests, lend credibility to the results, demonstrating significant trends across different models and datasets. However, the reliance on specific datasets, such as the ShutterStock corpus, may limit the generalizability of the findings to broader music contexts.
The paper provides sufficient detail regarding the experimental setup, including the parameters used for noise injection and the models evaluated. However, the lack of explicit information on the datasets and the absence of a publicly available code repository may hinder full reproducibility. The demo page offers some interactive elements, but a complete code release would enhance reproducibility.
One limitation is the focus on specific types of perturbations (noise and order shuffling), which may not encompass all forms of musical corruption. Additionally, the findings may not fully address the complexities of human judgment in music evaluation, as the study primarily relies on model behavior rather than direct comparisons with human assessments.
The implications of this work are significant for the field of music generation and evaluation. By highlighting the limitations of likelihood-based metrics, the paper paves the way for developing more robust evaluation frameworks that align better with human perceptions of musical quality. This could lead to advancements in training objectives for music LLMs and improve the overall quality of generated music. The main contribution of this paper is the introduction of a new evaluation framework for music LLMs based on the dynamics of loss curves, which reveals critical insights into model behavior and challenges existing evaluation paradigms. The findings underscore the need for a shift in how we assess musical quality in generative models, emphasizing the importance of understanding local dynamics over absolute loss values.
Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator's task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.
Primary: Tsinghua University
All Institutions: Tsinghua University, Dalian Maritime University, Shenzhen International Graduate School
The main contribution of this paper is the introduction of TLDiffGAN, a novel framework that integrates latent diffusion models with GANs for improved anomalous sound detection. This work significantly advances the state of the art in the field by addressing key limitations of existing generative models and demonstrating superior performance through rigorous experimental validation.
The proposed TLDiffGAN framework innovatively combines latent diffusion models with GANs to enhance the quality of generated spectrograms for anomalous sound detection. The dual-branch architecture effectively integrates features from both raw audio and Mel spectrograms, addressing the limitations of traditional single-modality approaches. The introduction of the TMixup technique to augment temporal features is a significant methodological advancement, enhancing the model's sensitivity to subtle anomalies. However, the complexity of the model may pose challenges in terms of interpretability and practical deployment.
The experiments conducted on the DCASE 2020 Challenge Task 2 dataset are extensive and demonstrate a clear improvement over existing methods in terms of AUC and pAUC metrics. The comparative analysis with other state-of-the-art methods provides strong evidence for the effectiveness of TLDiffGAN. The ablation studies further validate the contributions of each component, reinforcing the robustness of the proposed framework.
The paper provides detailed implementation details, including network configurations, training protocols, and evaluation metrics, which support reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which others can replicate the results.
One limitation is the reliance on a specific dataset (DCASE 2020) for evaluation, which may not fully capture the generalizability of the model across different domains or types of anomalous sounds. Additionally, the model's complexity could lead to challenges in real-time applications, particularly in resource-constrained environments.
The framework has significant implications for industrial applications, particularly in predictive maintenance and monitoring of machinery, where timely detection of anomalies can prevent failures and reduce downtime. The ability to localize anomalies in the time-frequency domain enhances interpretability, which is crucial for practitioners in the field. The main contribution of this paper is the introduction of TLDiffGAN, a novel framework that integrates latent diffusion models with GANs for improved anomalous sound detection. This work significantly advances the state of the art in the field by addressing key limitations of existing generative models and demonstrating superior performance through rigorous experimental validation.
We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.
Primary: Tsinghua University
All Institutions: Tsinghua University, Ant Group, Shenzhen International Graduate School
The main contribution of this paper is the introduction of a novel framework for scene-aware visually-driven speech synthesis, which significantly advances the field by addressing key challenges in multimodal alignment and data scarcity. The technical contributions, particularly the innovative dataset and alignment module, position this work as a meaningful advancement in audio synthesis research, although further detail on implementation and broader applicability is needed.
The proposed methodology, VividVoice, introduces a unified generative framework that addresses the challenges of data scarcity and modality decoupling in speech synthesis. The construction of the Vivid-210K dataset is a significant contribution, as it establishes a novel correlation between visual scenes, speaker identity, and audio. The D-MSVA alignment module is innovative, utilizing a decoupled memory bank architecture and hybrid supervision strategy, which enhances the model's ability to align visual and auditory modalities effectively. However, the paper could benefit from a more detailed description of the implementation and the specific algorithms used within the D-MSVA module.
The experimental evaluation is robust, featuring both subjective and objective assessments that demonstrate the superiority of VividVoice over existing baseline models. The results indicate improvements in audio fidelity, content clarity, and multimodal consistency, which are critical metrics in speech synthesis. However, the paper lacks a comprehensive comparison with a wider range of baseline models and does not provide enough detail on the experimental setup, such as the number of participants in subjective tests or the specific metrics used for objective evaluation.
The paper does not provide sufficient implementation details that would facilitate reproducibility. While it mentions the construction of the Vivid-210K dataset and the D-MSVA module, it lacks code availability or a clear description of the training process, hyperparameters, and evaluation protocols. This limits the ability of other researchers to replicate the results and build upon this work.
One limitation is the reliance on a single dataset (Vivid-210K), which may not generalize across different contexts or speaker demographics. Additionally, the paper does not address potential biases in the dataset or the implications of using a specific set of visual scenes. The complexity of the D-MSVA module may also pose challenges for real-time applications, which are critical in practical speech synthesis scenarios.
The implications of Scene-Aware Visually-Driven Speech Synthesis are significant, particularly in applications such as virtual reality, gaming, and assistive technologies. By creating more immersive auditory experiences that align with visual contexts, this research can enhance user engagement and accessibility. However, ethical considerations regarding the use of such technology, particularly in terms of deepfakes or misinformation, should be addressed. The main contribution of this paper is the introduction of a novel framework for scene-aware visually-driven speech synthesis, which significantly advances the field by addressing key challenges in multimodal alignment and data scarcity. The technical contributions, particularly the innovative dataset and alignment module, position this work as a meaningful advancement in audio synthesis research, although further detail on implementation and broader applicability is needed.
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.
Primary: University of Melbourne
All Institutions: University of Melbourne
The main contribution of this paper is the introduction of HierCon, a hierarchical contrastive attention framework that significantly improves audio deepfake detection by effectively modeling temporal and inter-layer dependencies, thereby achieving state-of-the-art performance on benchmark datasets. This work represents a meaningful advancement in the field, addressing critical challenges in distinguishing between real and synthetic audio.
The paper introduces HierCon, a novel hierarchical layer attention framework that effectively captures temporal and inter-layer dependencies in audio deepfake detection. The methodology is well-structured, employing a three-stage attention mechanism that enhances the model's ability to discern subtle differences between real and synthetic audio. The integration of margin-based contrastive learning is particularly noteworthy, as it encourages the model to develop domain-invariant embeddings, thereby improving generalization across various deepfake generation techniques. The detailed explanation of the attention mechanism and the loss functions used provides a solid foundation for understanding the proposed approach.
The authors conduct thorough experiments on multiple datasets, including ASVspoof 2021 DF and In-the-Wild, demonstrating significant improvements over existing methods. The reported results, including Equal Error Rates (EER), clearly indicate the effectiveness of HierCon, achieving state-of-the-art performance. The inclusion of ablation studies further strengthens the findings, allowing for a clear understanding of the contributions of hierarchical attention and contrastive learning to the overall performance.
While the paper provides a comprehensive description of the methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a demo or project URL also limits the ability for others to validate the findings independently.
One limitation of the study is the reliance on specific datasets for evaluation, which may not fully capture the diversity of real-world audio deepfakes. Additionally, while the hierarchical attention mechanism is promising, the complexity of the model may pose challenges in terms of computational efficiency and scalability for real-time applications.
The implications of this research are significant, particularly in the context of security and online trust, as audio deepfakes pose increasing risks in various domains, including voice authentication and digital forensics. The proposed method has the potential to enhance the robustness of detection systems, contributing to the development of more secure communication technologies. The main contribution of this paper is the introduction of HierCon, a hierarchical contrastive attention framework that significantly improves audio deepfake detection by effectively modeling temporal and inter-layer dependencies, thereby achieving state-of-the-art performance on benchmark datasets. This work represents a meaningful advancement in the field, addressing critical challenges in distinguishing between real and synthetic audio.
This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/
Primary: Bar-Ilan University
All Institutions: Bar-Ilan University, OriginAI
The paper presents a novel unsupervised generative method for audio-visual speech separation that significantly advances the field. The technical contributions, particularly in leveraging diffusion models and visual cues, offer promising directions for future research and practical applications in speech processing.
The methodology proposed in SSNAPS is innovative, leveraging generative inverse sampling with diffusion models to separate speech from background noise in an unsupervised manner. The paper reformulates existing inverse sampling techniques to accommodate multiple independent signals and integrates visual cues from lip movements to enhance separation accuracy. The approach's flexibility in handling varying numbers of speakers and the introduction of a novel loss function for off-screen speaker separation are significant advancements. However, the reliance on visual data may limit applicability in scenarios without such cues.
The experimental evaluation is robust, with comprehensive testing on mixtures of 1, 2, and 3 speakers across different noise conditions. The results demonstrate that SSNAPS consistently outperforms leading supervised baselines in terms of word error rate (WER), showcasing the effectiveness of the unsupervised approach. The paper provides detailed metrics and comparisons, enhancing the credibility of the findings. However, the paper could benefit from additional qualitative assessments or user studies to further validate the audio quality of the separated signals.
The paper includes sufficient implementation details, including datasets, model architectures, and hyperparameters, which facilitate reproducibility. The authors provide a demo page, but the absence of a public code repository limits the ability for others to reproduce the results independently. The detailed explanation of the experimental setup and evaluation metrics is commendable, yet sharing the actual code would enhance transparency.
One key limitation is the dependency on visual data for performance, which may not be available in all real-world applications. Additionally, while the method shows promise in separating speech from noise, the computational efficiency could be improved, as indicated by the longer runtime compared to supervised methods. The paper also does not address potential challenges in scenarios with more complex noise environments or overlapping speech characteristics.
The advancements presented in this paper have significant implications for various applications, including telecommunication, assistive technologies for the hearing impaired, and audio-visual media processing. By improving speech separation in noisy environments, the method could enhance user experiences in real-world settings, making communication clearer and more effective. The unsupervised nature of the approach also suggests potential for broader adoption in diverse applications without the need for extensive labeled datasets. The paper presents a novel unsupervised generative method for audio-visual speech separation that significantly advances the field. The technical contributions, particularly in leveraging diffusion models and visual cues, offer promising directions for future research and practical applications in speech processing.
Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.
Primary: unknown
All Institutions: unknown
The paper presents a novel adaptation framework for multilingual speech recognition that leverages a structured analysis of layer-wise adaptability, significantly improving efficiency and performance in low-resource language settings. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential to advance speech technology accessibility.
The proposed Depth-Aware Model Adaptation (DAMA) framework introduces a novel approach to multilingual ASR by analyzing layer-wise adaptability and implementing a U-shaped adaptability pattern. This structured adaptation strategy effectively allocates training resources, enhancing efficiency and performance in low-resource language scenarios. The integration of SVD-based initialization and Basis-Protected Projection further solidifies the method's robustness, allowing for effective adaptation while preserving essential language-agnostic representations.
The experiments conducted on 18 low-resource languages using two benchmark datasets (Common Voice and FLEURS) demonstrate the effectiveness of DAMA. The results indicate that DAMA not only matches or surpasses state-of-the-art performance but also significantly reduces the number of trainable parameters and computational costs. The thorough evaluation across different languages and settings adds credibility to the findings, showcasing the framework's adaptability and efficiency.
The paper provides detailed implementation details, including the datasets used, experimental setup, and hyperparameter settings, which facilitate reproducibility. However, the lack of a publicly available code repository limits the ease of replication for external researchers.
While the study reveals significant findings, it is limited to 18 languages, and the generalizability of the U-shaped adaptability pattern across even more diverse languages remains to be tested. Additionally, the method is optimized for low-resource settings, which may not translate to high-resource scenarios without further adjustments.
The findings have the potential to significantly enhance multilingual speech recognition technologies, particularly for low-resource languages, thereby promoting inclusivity in speech technology applications. This could lead to broader accessibility and usability of speech recognition systems in diverse linguistic contexts. The paper presents a novel adaptation framework for multilingual speech recognition that leverages a structured analysis of layer-wise adaptability, significantly improving efficiency and performance in low-resource language settings. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential to advance speech technology accessibility.
Emotion recognition from human speech is a critical enabler for socially aware conversational AI. However, while most prior work frames emotion recognition as a categorical classification problem, real-world affective states are often ambiguous, overlapping, and context-dependent, posing significant challenges for both annotation and automatic modeling. Recent large-scale audio language models (ALMs) offer new opportunities for nuanced affective reasoning without explicit emotion supervision, but their capacity to handle ambiguous emotions remains underexplored. At the same time, advances in inference-time techniques such as test-time scaling (TTS) have shown promise for improving generalization and adaptability in hard NLP tasks, but their relevance to affective computing is still largely unknown. In this work, we introduce the first benchmark for ambiguous emotion recognition in speech with ALMs under test-time scaling. Our evaluation systematically compares eight state-of-the-art ALMs and five TTS strategies across three prominent speech emotion datasets. We further provide an in-depth analysis of the interaction between model capacity, TTS, and affective ambiguity, offering new insights into the computational and representational challenges of ambiguous emotion understanding. Our benchmark establishes a foundation for developing more robust, context-aware, and emotionally intelligent speech-based AI systems, and highlights key future directions for bridging the gap between model assumptions and the complexity of real-world human emotion.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a benchmark for ambiguous emotion recognition in speech using audio language models under test-time scaling. This work addresses a critical gap in the field of affective computing by exploring the complexities of real-world emotions, thereby paving the way for more nuanced and context-aware AI systems.
The methodology introduces a novel benchmark for ambiguous emotion recognition using audio language models (ALMs) and test-time scaling (TTS). The systematic comparison of eight state-of-the-art ALMs and five TTS strategies across three datasets is a significant methodological contribution, as it addresses the complexity of real-world emotional states that are often not captured in traditional categorical frameworks. The paper effectively combines existing techniques in a new context, but lacks detailed descriptions of the TTS strategies and their implementation specifics, which could enhance reproducibility.
The experiments are well-structured, utilizing multiple datasets to validate the proposed benchmark. The evaluation metrics, while not explicitly detailed in the abstract, are likely comprehensive given the context. However, the paper could benefit from more extensive quantitative results and visualizations to better illustrate the performance differences between models and TTS strategies. The inclusion of qualitative analyses or case studies could also provide deeper insights into the model's handling of ambiguous emotions.
The paper does not provide sufficient implementation details or access to code and data, which raises concerns about reproducibility. While it mentions the use of existing datasets, without clear guidelines or links to the datasets and the specific configurations used in experiments, it may be challenging for other researchers to replicate the findings.
The paper acknowledges the complexity of real-world emotions but does not fully address the limitations of the proposed methods. For instance, the reliance on existing ALMs may limit the generalizability of the findings. Furthermore, the interaction between TTS and model capacity could be explored more rigorously, as the current analysis may not capture all nuances of the performance variations.
The research has significant implications for the development of emotionally intelligent AI systems, particularly in conversational agents and social robotics. By providing a framework for understanding ambiguous emotions, this work could enhance user interactions in various applications, from customer service to mental health support. The establishment of a benchmark for ambiguous emotion recognition also opens avenues for future research in affective computing. The main contribution of this paper is the introduction of a benchmark for ambiguous emotion recognition in speech using audio language models under test-time scaling. This work addresses a critical gap in the field of affective computing by exploring the complexities of real-world emotions, thereby paving the way for more nuanced and context-aware AI systems.
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
Primary: Inria, LIRMM, Université de Montpellier
All Institutions: Inria, LIRMM, Université de Montpellier, Earth Species Project, University of Kassel
The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
The methodology presented in this paper is innovative as it proposes a contrastive distillation approach to bridge audio and image modalities without requiring paired data. By leveraging a pretrained image-text model (BioCLIP-2) to enhance the audio-text model (BioLingual), the authors effectively create a semantic intermediary that facilitates meaningful audio-to-image retrieval. The use of a contrastive objective for fine-tuning the audio encoder is well-justified and demonstrates a clear understanding of the underlying challenges in cross-modal representation learning. The simplicity of the approach, which avoids complex multi-objective training and direct image supervision, is a significant strength.
The experiments are robust, utilizing multiple bioacoustic benchmarks to validate the effectiveness of the proposed method. The results indicate that the distilled audio encoder not only improves audio-to-image retrieval performance but also preserves the discriminative capabilities of the audio model. The comparisons against various baselines, including zero-shot and text-embedding mapping strategies, provide a comprehensive evaluation of the method's effectiveness. The use of independent datasets for validation strengthens the credibility of the findings.
The paper mentions that the code will be publicly available after review, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics, such as hyperparameter settings, training duration, and computational resources, which are essential for other researchers to replicate the experiments fully.
One limitation of the study is the reliance on the quality and representativeness of the textual descriptions used for training the audio encoder. If the textual descriptions are not sufficiently diverse or comprehensive, it may impact the generalization of the model. Additionally, while the approach demonstrates strong performance on the evaluated datasets, its applicability to other domains or species not represented in the training data remains uncertain.
The implications of this research are significant for biodiversity monitoring and conservation efforts, particularly in scenarios where paired audio-image data is scarce. By enabling effective audio-to-image retrieval, the proposed method can assist researchers and conservationists in identifying species based on audio recordings, thus enhancing ecological studies and wildlife conservation strategies. The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology, Tsinghua University, Sun Yat-sen University
The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
The proposed Dual-View Predictive Diffusion (DVPD) model introduces a novel approach to speech enhancement by leveraging the dual nature of spectrograms as both visual textures and physical frequency-domain representations. The methodology includes the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which effectively preserves critical low-frequency harmonics while reducing high-frequency redundancies, and the Lightweight Image-based Spectro-Awareness (LISA) module, which captures features from a visual perspective. The Training-free Lossless Boost (TLB) strategy further enhances the model's performance during inference without additional training, showcasing a well-thought-out integration of predictive and generative paradigms.
The experiments are extensive, covering various benchmarks including WSJ0-UNI and VBDMD, demonstrating the model's effectiveness across different distortion scenarios. The results indicate that DVPD achieves state-of-the-art performance while significantly reducing computational complexity compared to existing models. The comprehensive evaluation metrics used, such as PESQ and ESTOI, provide a robust assessment of the model's capabilities.
The paper includes detailed implementation details, including training configurations, loss functions, and evaluation metrics, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers.
While the model demonstrates impressive performance, it may still struggle with certain types of distortions not covered in the training datasets. Additionally, the reliance on specific hyperparameters for the TLB strategy may introduce variability in performance across different applications.
The advancements presented in this paper have significant implications for real-world applications in speech enhancement, particularly in noisy environments. The lightweight nature of the model makes it suitable for deployment in resource-constrained settings, potentially benefiting various industries, including telecommunications and assistive technologies. The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints -- we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
Primary: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences
All Institutions: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Department of Automation, Tsinghua University, Beijing National Research Center for Information Science and Technology, Tsinghua University
The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
The proposed methodology introduces a novel framework for text-based speech editing that effectively decouples semantic content from acoustic features, addressing the limitations of existing methods that often lead to artifacts and instability. The use of a Flow Matching decoder for acoustic reconstruction and a Self-Consistency Rewards mechanism for perceptual alignment is innovative and well-justified, leveraging a pre-trained TTS model as an implicit critic. This dual-stage approach enhances both intelligibility and naturalness, making significant strides in the field of speech editing.
The experiments are comprehensive, utilizing a large-scale dataset (Libriheavy) and rigorous benchmarks for evaluation. The authors provide detailed comparisons against state-of-the-art models, demonstrating clear improvements in metrics such as WER, speaker similarity, and perceptual quality. The use of both objective and subjective metrics strengthens the evaluation, although further details on the statistical significance of results would enhance the robustness of the findings.
The paper includes sufficient implementation details, including training configurations and the architecture of the models used. However, the absence of a publicly available code repository limits full reproducibility. Providing access to the code and trained models would significantly enhance the paper's impact and allow for independent verification of results.
While the proposed method shows strong performance, the paper does not address potential limitations in terms of computational efficiency or the scalability of the approach to diverse languages or dialects. Additionally, the reliance on a pre-trained TTS model may introduce biases based on the training data used for that model.
The implications of this research are significant for various applications, including media production, accessibility technologies, and real-time speech editing in communication tools. The ability to edit speech seamlessly could enhance user experience and efficiency in numerous fields, from entertainment to education. The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform reconstruction. As a result, when operating at ultra-low bitrates, these methods degrade rapidly and often fail to preserve essential information, leading to severe acoustic artifacts and pronounced semantic distortion. To overcome these limitations, we introduce Generative Audio Compression (GAC), a novel paradigm shift from signal fidelity to task-oriented effectiveness. Implemented within the AI Flow framework, GAC is theoretically grounded in the Law of Information Capacity. These foundations posit that abundant computational power can be leveraged at the receiver to offset extreme communication bottlenecks--exemplifying the More Computation, Less Bandwidth philosophy. By integrating semantic understanding at the transmitter with scalable generative synthesis at the receiver, GAC offloads the information burden to powerful model priors. Our 1.8B-parameter model achieves high-fidelity reconstruction of 32kHz general audio at an unprecedented bitrate of 0.275kbps. Even at 0.175kbps, it still preserves a strong intelligible audio transmission capability, which represents an about 3000x compression ratio, significantly outperforming current state-of-the-art neural codecs in maintaining both perceptual quality and semantic consistency.
Primary: Institute of Artificial Intelligence, China Telecom
All Institutions: Institute of Artificial Intelligence, China Telecom
The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
The proposed Generative Audio Compression (GAC) method represents a significant shift from traditional audio compression techniques by focusing on task-oriented effectiveness rather than pure signal fidelity. The integration of semantic understanding at the transmitter and generative synthesis at the receiver is a novel approach that leverages the Law of Information Capacity to optimize the trade-off between computation and bandwidth. The methodology is well-grounded in theoretical frameworks and employs advanced techniques such as latent-variable modeling and variational objectives, showcasing a comprehensive understanding of both audio processing and machine learning principles.
The experiments are robust, covering multiple audio domains (speech, general sound, and music) and employing both objective and subjective evaluation metrics. The results demonstrate GAC's superior performance in maintaining perceptual quality and semantic consistency at extremely low bitrates, significantly outperforming existing state-of-the-art methods. The use of diverse datasets and thorough evaluation metrics strengthens the credibility of the findings.
While the paper provides a detailed description of the methodology and experimental setup, it lacks explicit implementation details or links to code repositories, which could hinder reproducibility. The absence of a demo or project URL further limits the ability for others to replicate the results.
One notable limitation is the trade-off between perceptual quality and speaker identity preservation at lower bitrates, which could affect applications requiring high fidelity in speaker recognition. Additionally, the reliance on large model sizes may limit practical deployment in resource-constrained environments.
The implications of GAC are significant for applications in low-bandwidth communication and generative audio-language modeling, potentially transforming how audio is transmitted and processed in various contexts. The approach could lead to advancements in telecommunication, streaming services, and assistive technologies, making high-quality audio accessible even in challenging bandwidth scenarios. The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.
Primary: The University of British Columbia
All Institutions: The University of British Columbia, Vector Institute
The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
The paper introduces RVCBench, a benchmark designed to evaluate the robustness of voice cloning models across various challenges. The methodology is comprehensive, covering a wide range of robustness tasks and including a significant dataset of 225 speakers and over 14,000 utterances. The authors systematically assess the performance of 11 modern voice cloning models under different conditions, which is a valuable approach to understanding the limitations of current technology. However, the paper could benefit from a more detailed explanation of how the robustness tasks were selected and the specific metrics used for evaluation.
The experiments are well-structured, with a clear focus on identifying performance gaps in voice cloning models under realistic conditions. The inclusion of various input variations and adversarial perturbations is a strong point, as it reflects real-world challenges. The results highlight significant robustness issues, which are crucial for advancing the field. However, the paper lacks a comparative analysis with existing benchmarks, which would strengthen its contributions.
The paper mentions that the project is open-sourced, which is a positive aspect for reproducibility. However, it lacks detailed implementation instructions or specific configurations used during experiments, which could hinder other researchers from replicating the results effectively.
One limitation is the potential bias in the selection of speakers and utterances, which may not represent the full diversity of voice characteristics in the real world. Additionally, while the benchmark covers various robustness tasks, it may not encompass all possible deployment scenarios that could affect voice cloning performance.
The findings of this paper have significant implications for the development of more robust voice cloning technologies, which could enhance applications in personalized speech interfaces and dubbing. By identifying and addressing robustness gaps, the research can contribute to safer and more reliable deployment of voice cloning systems in real-world applications. The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints -- scaling from short loops to 10-minute compositions -- while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities -- such as cover generation, repainting, and vocal-to-BGM conversion -- while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
The methodology introduces a hybrid architecture that combines a Language Model (LM) with a Diffusion Transformer (DiT) to generate music. The use of intrinsic reinforcement learning to align the LM's planning capabilities with the DiT's synthesis process is a notable innovation. The model's ability to generate music based on simple user queries and to personalize outputs with minimal input data is a significant advancement in the field of music generation. However, the paper could benefit from a more detailed explanation of the reinforcement learning mechanism and how it mitigates biases.
The paper claims that ACE-Step v1.5 achieves superior performance on commonly used evaluation metrics compared to existing commercial models. The reported generation times are impressive, especially for consumer hardware, and the ability to run on low VRAM is a practical advantage. However, the paper lacks detailed experimental results, including quantitative comparisons with baseline models, which would strengthen the claims made about performance and efficiency.
The availability of code, model weights, and a demo is a positive aspect, promoting reproducibility. However, the paper does not provide sufficient details on the training process, dataset specifics, or evaluation metrics used, which are crucial for other researchers to replicate the results effectively.
One limitation is the lack of extensive evaluation on diverse datasets to validate the model's performance across various music genres and styles. Additionally, the reliance on intrinsic reinforcement learning may limit the model's adaptability to more complex user preferences that external reward models could capture. The paper also does not address potential ethical considerations regarding music generation and copyright issues.
The potential applications of ACE-Step v1.5 are vast, ranging from aiding music artists in their creative processes to providing tools for content creators. Its ability to generate high-quality music quickly and with low resource requirements could democratize music production, making it accessible to a broader audience. However, the implications of AI-generated music on the music industry and artist livelihoods should be carefully considered. The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints -- scaling from short loops to 10-minute compositions -- while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities -- such as cover generation, repainting, and vocal-to-BGM conversion -- while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines advanced methodologies to achieve high-quality outputs on consumer hardware. This work represents a significant step forward in making sophisticated music generation tools available and usable for a broader audience, while also pushing the boundaries of current methodologies in the field.
The paper introduces ACE-Step v1.5, which employs a hybrid architecture combining a Language Model (LM) for planning and a Diffusion Transformer (DiT) for synthesis. The use of intrinsic reinforcement learning to align the LM and DiT is particularly innovative, as it circumvents biases from external reward models. The lightweight personalization feature, allowing users to train a LoRA with minimal data, is a significant advancement for user-centric music generation.
The evaluation metrics presented in the paper indicate that ACE-Step v1.5 outperforms many commercial models in terms of quality while maintaining efficiency. The benchmarks on various hardware configurations (A100 and RTX 3090) demonstrate its practical applicability. However, specific datasets and detailed experimental setups are not elaborated, which could enhance the credibility of the results.
The paper provides a URL for accessing the code and model weights, which is a positive aspect for reproducibility. However, the lack of detailed descriptions of the training process and datasets used may hinder full reproducibility by other researchers.
The paper does not address potential limitations such as the model's performance on diverse musical genres or its ability to handle complex user prompts. Additionally, the reliance on intrinsic reinforcement learning may limit the model's adaptability to user preferences that are not well-represented in the training data.
ACE-Step v1.5 has the potential to democratize music generation, making high-quality tools accessible to a wider audience, including amateur musicians and content creators. Its capabilities for stylistic control and editing could significantly enhance creative workflows in music production. The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines advanced methodologies to achieve high-quality outputs on consumer hardware. This work represents a significant step forward in making sophisticated music generation tools available and usable for a broader audience, while also pushing the boundaries of current methodologies in the field.