Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
The methodology presented in this paper is robust and innovative, addressing the unique challenges of continual learning (CL) in audio contexts, particularly the upstream-downstream misalignment that has hindered previous approaches. The introduction of PACE, which combines improved first-session adaptation (FSA) with multi-session adaptation (MSA) and boundary-aware regularization, is a significant advancement. The paper meticulously details the design choices behind each component, demonstrating a clear understanding of the audio domain's intricacies. The use of analytic classifiers and adaptive subspace-orthogonal PEFT is particularly noteworthy, as it showcases a tailored approach to audio CL that diverges from traditional vision-based methods.
The experimental evaluation is thorough, employing six diverse audio CL benchmarks that effectively highlight the strengths and weaknesses of the proposed method. The results consistently demonstrate that PACE outperforms state-of-the-art methods, providing strong empirical evidence for its effectiveness. The ablation studies further reinforce the validity of the proposed components, illustrating how each contributes to the overall performance. However, the paper could benefit from additional comparisons with more recent methods in the audio domain, if available.
The authors commit to releasing their code and benchmarks, which is a positive aspect for reproducibility. The detailed descriptions of the experimental setup, including hyperparameters and dataset configurations, enhance the likelihood that other researchers can replicate the results. However, the absence of a demo or interactive component limits immediate accessibility for broader audiences.
One limitation is the potential for overfitting in fine-grained tasks, as indicated by the authors. The paper also acknowledges that while PACE narrows the gap to joint training, it does not completely eliminate it, suggesting that further improvements could be made. Additionally, the reliance on specific pretrained models may limit the generalizability of the findings across different audio tasks.
The implications of this work are significant, particularly for applications in speech recognition, audio event detection, and environmental sound understanding. By addressing the challenges of continual learning in audio, the proposed methods could enhance the robustness and adaptability of audio models in real-world scenarios, leading to more effective and reliable systems. The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Ping An Technology (Shenzhen) Co
The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
The proposed PL-Distill framework introduces a dual-level knowledge distillation approach that effectively addresses the challenges of distilling large audio-language models for speech emotion recognition. The incorporation of Attention-weighted Centered Kernel Alignment (AwCKA) is particularly innovative, as it dynamically prioritizes important audio tokens based on attention scores, thereby enhancing the alignment of audio embeddings despite dimensional mismatches. This methodological advancement is well-justified in the context of previous work and represents a significant contribution to the field of knowledge distillation in multimodal models.
The experimental evaluation is robust, utilizing three widely recognized datasets (IEMOCAP, RAVDESS, and SAVEE) to validate the effectiveness of the proposed method. The results demonstrate that PL-Distill not only compresses the teacher model significantly but also outperforms both the teacher and state-of-the-art models across all metrics. The ablation studies further substantiate the contributions of each component of the framework, providing a clear understanding of the impact of the proposed methods.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully generalize to other SER tasks or datasets. Additionally, while the method shows promise, the computational efficiency of the distillation process itself could be further explored to ensure practical applicability in real-world scenarios.
The implications of this research extend beyond speech emotion recognition, as the PL-Distill framework could be adapted for various audio-language tasks, potentially improving the efficiency of deploying large models in resource-constrained environments. The focus on effective knowledge transfer in multimodal contexts may also inspire future research in related areas. The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
The paper introduces the Audio ControlNet framework, which innovatively builds on pre-trained text-to-audio (T2A) models by integrating lightweight auxiliary networks for fine-grained control over audio attributes such as loudness, pitch, and sound events. The two proposed architectures, T2A-ControlNet and T2A-Adapter, are well-structured, with T2A-Adapter demonstrating efficiency through fewer parameters while maintaining high performance. The methodology is sound, leveraging established techniques from the ControlNet paradigm and adapting them to the audio domain, thus showcasing a thoughtful approach to enhancing existing models without extensive retraining.
The experiments are comprehensive, utilizing the AudioSet-Strong dataset for both training and evaluation, which is appropriate given the task. The results indicate that T2A-Adapter achieves state-of-the-art performance in sound event detection metrics, outperforming existing models while using significantly fewer parameters. The paper includes both objective metrics (F1 scores) and subjective evaluations (MOS), providing a well-rounded assessment of model performance. However, the paper could benefit from more detailed comparisons with a broader range of baseline models to further validate its claims.
The authors mention plans to release models, code, dataset pipelines, and benchmarks, which is a positive step towards reproducibility. However, specific implementation details, such as hyperparameter settings and training configurations, could be more explicitly stated to enhance clarity and facilitate replication by other researchers.
The paper acknowledges limitations, such as the computational constraints that prevented exhaustive hyperparameter searches and the focus on a limited set of control conditions. Additionally, the reliance on generalization for multi-condition control at inference time may not be robust across all scenarios. Future work is suggested to explore richer control signals and more comprehensive multi-condition training.
The framework has significant potential applications in sound design, music creation, and video production, where precise audio generation and editing are crucial. The ability to manipulate audio attributes with fine granularity can enhance creative workflows and enable new forms of audio content generation. However, ethical considerations regarding the misuse of generated audio, such as impersonation or disinformation, must be addressed to ensure responsible deployment. The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
Transformer-based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi-head self-attention (MHSA) mechanism. MHSA provides frame-level attention scores, which are particularly valuable because deepfake artifacts often occur in small, localized regions along the temporal dimension of speech. This makes fine-grained frame modeling essential for accurately detecting subtle spoofing cues. In this work, we propose fine-grained frame modeling (FGFM) for MHSA-based speech deepfake detection, where the most informative frames are first selected through a multi-head voting (MHV) module. These selected frames are then refined via a cross-layer refinement (CLR) module to enhance the model's ability to learn subtle spoofing cues. Experimental results demonstrate that our method outperforms the baseline model and achieves Equal Error Rate (EER) of 0.90%, 1.88%, and 6.64% on the LA21, DF21, and ITW datasets, respectively. These consistent improvements across multiple benchmarks highlight the effectiveness of our fine-grained modeling for robust speech deepfake detection.
Primary: Hanoi University of Science and Technology
All Institutions: Hanoi University of Science and Technology, Nanyang Technological University
The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
The proposed methodology introduces a novel fine-grained frame modeling (FGFM) approach that effectively enhances the multi-head self-attention (MHSA) mechanism for speech deepfake detection. The integration of the multi-head voting (MHV) module to select salient frames and the cross-layer refinement (CLR) module to aggregate information across layers is innovative. This dual approach addresses the limitations of conventional MHSA by focusing on localized artifacts, which are critical for detecting subtle spoofing cues. The methodology is well-structured and builds upon existing transformer architectures, demonstrating a clear understanding of the challenges in deepfake detection.
The experimental evaluation is robust, utilizing multiple datasets (ASVspoof 2021 LA, DF, and ITW) to validate the effectiveness of the proposed method. The reported Equal Error Rates (EER) indicate significant improvements over baseline models, showcasing the method's effectiveness across diverse conditions. The inclusion of ablation studies further strengthens the evaluation, providing insights into the contributions of individual components of the proposed framework.
The paper provides sufficient detail regarding the experimental setup, including model configurations and training procedures, which supports reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings. Future work should consider making the implementation accessible to enhance reproducibility.
While the proposed method shows promising results, it may still be sensitive to variations in the quality of the input audio, such as background noise or recording conditions. Additionally, the reliance on specific datasets may limit the generalizability of the findings to real-world applications. The paper could benefit from a discussion on how the model performs under such conditions.
The implications of this research are significant, particularly in the context of biometric security and misinformation. As deepfake technology becomes more sophisticated, effective detection methods are crucial for safeguarding against potential abuses in various sectors, including finance and communication. The proposed FGFM approach could contribute to the development of more reliable detection systems, thereby enhancing trust in voice-based interactions. The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.
Primary: Technion--Israel Institute of Technology
All Institutions: Technion--Israel Institute of Technology, Carnegie Mellon University
The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
The methodology presented in this paper is robust and innovative, focusing on the generation of unsafe spoken dialogues and the evaluation of large audio-language models (LALMs) as safety judges. The controlled generation of unsafe dialogue variants, along with the systematic benchmarking of LALMs across different modalities, is a significant contribution to the field. The use of human raters to validate the generated unsafe dialogues and the severity scale adds credibility to the findings. The paper also effectively addresses the challenges of audio-specific cues and transcription errors, which are often overlooked in text-centric assessments.
The experimental evaluation is thorough, with a well-defined dataset of 24,000 dialogues and a clear methodology for assessing the performance of the LALMs. The results reveal important trade-offs between sensitivity, specificity, and stability across different models and modalities. The use of various prompting strategies to optimize performance further demonstrates a comprehensive approach to evaluating the models. However, the paper could benefit from more detailed statistical analysis and comparisons with existing benchmarks in the field.
The paper mentions plans to release the dataset and code, which is crucial for reproducibility. However, specific implementation details, such as the exact configurations used for the LALMs and the human raters' instructions, should be more explicitly stated to facilitate replication of the study. The inclusion of supplementary materials or appendices would enhance reproducibility.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world dialogues. Additionally, the potential for bias in the generated unsafe dialogues and the subjective nature of human ratings could impact the validity of the findings. The paper also acknowledges the risk of misuse of the benchmark data, which is an important ethical consideration.
The findings of this research have significant implications for the development of safer spoken dialogue systems and voice agents. By providing a systematic approach to evaluating harmful content in multi-turn dialogues, the work aims to improve the safety and reliability of voice interfaces. However, the potential for misuse of the generated data and the reliance on automated judges without human oversight could lead to unintended consequences in real-world applications. The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
The paper proposes a novel audio tokenizer, ReasoningCodec, which effectively separates audio representations into reasoning and reconstruction tokens. This dual-token approach allows for higher-level abstractions while maintaining fidelity in audio reconstruction. The architecture's functional layer specialization is a significant methodological advancement, optimizing the processing of audio and text tokens across different transformer layers, which is a departure from the traditional uniform approach. The introduction of auditory sentences as a means to unify task construction is innovative and enhances the model's ability to handle complex audio tasks.
The authors conducted extensive experiments across various speech, sound, and music tasks, demonstrating competitive performance on in-domain evaluations. The model's ability to generalize to unseen tasks in few-shot and zero-shot settings is particularly noteworthy, showcasing its robustness and versatility. However, the paper could benefit from more detailed quantitative results and comparisons with state-of-the-art models to better contextualize its performance.
The authors commit to providing demo, code, and checkpoints, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics and hyperparameter settings that would facilitate full reproducibility by other researchers.
The paper acknowledges potential risks associated with misuse of the technology, such as impersonation and copyright issues. However, it does not delve deeply into the technical limitations of the model itself, such as potential biases in the training data or the scalability of the approach to more complex audio tasks.
The proposed model has significant implications for applications in creative assistance, human-computer interaction, and audio generation. However, the authors rightly caution against potential misuse, emphasizing the need for responsible deployment practices to mitigate risks associated with audio generation technologies. The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
The proposed URSA-GAN framework presents a novel approach to address the challenges of domain adaptation in ASR and SE by leveraging a dual-embedding architecture that captures noise and channel characteristics. This method is innovative in its use of generative adversarial networks (GANs) combined with dynamic stochastic perturbation for enhanced robustness. The architecture is well-structured, with a clear delineation of roles for the noise encoder, channel encoder, and generator, which collectively facilitate effective domain adaptation. The introduction of instance-level embeddings and the use of feature-wise linear modulation (FiLM) for conditioning the generator on noise and channel characteristics are particularly noteworthy. However, the complexity of the model may pose challenges in practical applications.
The experiments conducted are extensive and cover a variety of datasets and scenarios, demonstrating the effectiveness of URSA-GAN in improving ASR and SE performance under mismatched conditions. The results show significant improvements in character error rates and perceptual metrics, validating the framework's robustness. The evaluation metrics used are appropriate, and the comparative analysis against baseline models and previous works strengthens the claims made by the authors. However, the paper could benefit from more detailed ablation studies to further clarify the contributions of individual components.
The paper provides a comprehensive description of the methodology, including the architecture, training process, and evaluation metrics, which facilitates reproducibility. However, the lack of a publicly available code repository or demo limits the ability of other researchers to replicate the experiments fully. Clearer documentation of hyperparameters and training configurations would enhance reproducibility.
One limitation is the reliance on pre-trained models for the noise and channel encoders, which may not generalize well to all domains. Additionally, the model's complexity could hinder its deployment in real-time applications, especially on resource-constrained devices. The performance gap between URSA-GAN and models trained on labeled target-domain data suggests that while the framework is effective, it may still require some labeled data for optimal performance.
The proposed framework has significant implications for real-world applications of ASR and SE, particularly in environments with varying noise and channel conditions. By improving the robustness of these systems, URSA-GAN could enhance user experiences in various domains, including telecommunications, voice assistants, and hearing aids. The approach also opens avenues for further research in domain adaptation techniques across different audio processing tasks. The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
The proposed methodology of PFluxTTS is innovative, combining a dual-decoder architecture that integrates both duration-guided and alignment-free models through inference-time vector-field fusion. This hybrid approach effectively addresses the stability-naturalness trade-off prevalent in existing TTS systems. The use of FLUX-based speech-prompt embeddings for robust cross-lingual voice cloning is a significant advancement, allowing the model to maintain speaker identity across languages without relying on prompt transcripts. Additionally, the integration of a modified PeriodWave vocoder with super-resolution capabilities to synthesize high-quality audio at 48 kHz from low-rate mel features is a noteworthy enhancement.
The experimental evaluation is comprehensive, utilizing a variety of datasets that reflect real-world challenges in TTS, particularly in cross-lingual scenarios. The authors provide both subjective and objective metrics to assess performance, demonstrating that PFluxTTS outperforms several state-of-the-art systems in terms of naturalness and speaker similarity. The use of statistical significance tests to validate the results adds rigor to the findings. However, the reliance on a limited number of baselines may restrict the generalizability of the conclusions.
The paper includes detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results fully. The authors could improve reproducibility by providing access to their training data and model checkpoints.
One limitation of the study is the potential overfitting to the specific datasets used for training and evaluation, which may not represent the full diversity of real-world speech. Additionally, while the system shows robustness in challenging conditions, the performance on extremely noisy or low-quality inputs is not thoroughly explored. The authors also note that the model's performance may vary with different languages, which could limit its applicability in multilingual contexts.
The advancements presented in PFluxTTS have significant implications for applications in AI dubbing, virtual assistants, and accessibility technologies. By improving cross-lingual voice cloning and audio quality, the system can enhance user experience in multilingual environments, making technology more inclusive. Furthermore, the research contributes to the ongoing development of high-fidelity TTS systems, which can benefit various industries, including entertainment, education, and customer service. The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
The methodology presented in this paper is robust and innovative, addressing the unique challenges of continual learning (CL) in audio contexts, particularly the upstream-downstream misalignment that has hindered previous approaches. The introduction of PACE, which combines improved first-session adaptation (FSA) with multi-session adaptation (MSA) and boundary-aware regularization, is a significant advancement. The paper meticulously details the design choices behind each component, demonstrating a clear understanding of the audio domain's intricacies. The use of analytic classifiers and adaptive subspace-orthogonal PEFT is particularly noteworthy, as it showcases a tailored approach to audio CL that diverges from traditional vision-based methods.
The experimental evaluation is thorough, employing six diverse audio CL benchmarks that effectively highlight the strengths and weaknesses of the proposed method. The results consistently demonstrate that PACE outperforms state-of-the-art methods, providing strong empirical evidence for its effectiveness. The ablation studies further reinforce the validity of the proposed components, illustrating how each contributes to the overall performance. However, the paper could benefit from additional comparisons with more recent methods in the audio domain, if available.
The authors commit to releasing their code and benchmarks, which is a positive aspect for reproducibility. The detailed descriptions of the experimental setup, including hyperparameters and dataset configurations, enhance the likelihood that other researchers can replicate the results. However, the absence of a demo or interactive component limits immediate accessibility for broader audiences.
One limitation is the potential for overfitting in fine-grained tasks, as indicated by the authors. The paper also acknowledges that while PACE narrows the gap to joint training, it does not completely eliminate it, suggesting that further improvements could be made. Additionally, the reliance on specific pretrained models may limit the generalizability of the findings across different audio tasks.
The implications of this work are significant, particularly for applications in speech recognition, audio event detection, and environmental sound understanding. By addressing the challenges of continual learning in audio, the proposed methods could enhance the robustness and adaptability of audio models in real-world scenarios, leading to more effective and reliable systems. The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
We propose a data-driven sparse recovery framework for hybrid spherical linear microphone arrays using singular value decomposition (SVD) of the transfer operator. The SVD yields orthogonal microphone and field modes, reducing to spherical harmonics (SH) in the SMA-only case, while incorporating LMAs introduces complementary modes beyond SH. Modal analysis reveals consistent divergence from SH across frequency, confirming the improved spatial selectivity. Experiments in reverberant conditions show reduced energy-map mismatch and angular error across frequency, distance, and source count, outperforming SMA-only and direct concatenation. The results demonstrate that SVD-modal processing provides a principled and unified treatment of hybrid arrays for robust sparse sound-field reconstruction.
Primary: The University of Sydney
All Institutions: The University of Sydney
The main contribution of this paper is the introduction of a unified SVD-modal framework for sparse sound field reconstruction using hybrid microphone arrays, which significantly improves spatial selectivity and robustness in reverberant environments. This work provides a principled approach that advances the state of the art in audio processing and sound field analysis, addressing key limitations of existing methods.
The proposed methodology leverages singular value decomposition (SVD) to derive a unified modal solution for sound field reconstruction using hybrid spherical-linear microphone arrays. This approach is innovative as it generalizes existing spherical harmonic (SH) processing while introducing complementary modes from linear microphone arrays (LMAs). The paper effectively integrates theoretical foundations with practical applications, demonstrating a clear understanding of the challenges posed by reverberant environments and the limitations of previous methods. The modal analysis and the use of a well-conditioned dictionary for sparse recovery are particularly noteworthy, as they provide a robust framework for addressing the underdetermined nature of the problem.
The experimental evaluation is comprehensive, utilizing simulations in reverberant conditions to assess the performance of the proposed method against baseline techniques such as SMA-only and residue refinement. The metrics employed, including energy map mismatch and angular error, are appropriate for the task and provide a clear indication of the method's effectiveness. The results consistently demonstrate the advantages of the SVD-modal framework, particularly in terms of spatial accuracy and robustness under varying conditions, which strengthens the paper's claims.
The paper lacks specific implementation details that would facilitate reproducibility, such as access to the datasets used for training and testing, or the code for the proposed algorithm. While the methodology is well described, the absence of a project URL or demo limits the ability of other researchers to replicate the findings. Clearer documentation and sharing of resources would enhance reproducibility.
One limitation of the study is the reliance on simulated environments, which may not fully capture the complexities of real-world acoustic conditions. Additionally, the trade-off between energy-map fidelity and localization accuracy when varying the number of modes could be further explored. The paper suggests future work on optimal mode selection, indicating that the current approach may not be universally applicable across all scenarios.
The proposed framework has significant implications for audio processing applications, particularly in environments where accurate sound field reconstruction is critical, such as in virtual reality, augmented reality, and advanced audio capture technologies. By improving the spatial resolution and robustness of sound field reconstruction, this work could enhance user experiences in immersive audio applications and contribute to advancements in spatial audio technologies. The main contribution of this paper is the introduction of a unified SVD-modal framework for sparse sound field reconstruction using hybrid microphone arrays, which significantly improves spatial selectivity and robustness in reverberant environments. This work provides a principled approach that advances the state of the art in audio processing and sound field analysis, addressing key limitations of existing methods.
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, Wuhan University, The Hong Kong University of Science and Technology (Guangzhou), The University of Auckland
The paper presents a pioneering approach to emotional TTS through activation steering, significantly advancing the field by enabling composable emotional expression and challenging existing paradigms in TTS architecture. The methodology is innovative, and while the experimental results are promising, further validation and implementation details would strengthen the contributions to the field.
The paper introduces a novel framework for emotional TTS that leverages activation steering via latent direction vectors. This approach is significant as it allows for composable and controllable emotional expression, addressing the limitations of existing TTS systems that typically enforce a single emotion per utterance. The methodology is well-structured, systematically analyzing the linear steerability of emotion representations and proposing a quantitative steering framework. The introduction of multi-rater evaluation protocols is particularly noteworthy, as it enhances the assessment of emotional synthesis quality.
The experiments conducted are robust, demonstrating the effectiveness of the proposed method in generating mixed-emotion synthesis and addressing text-emotion mismatches. The results indicate that emotional prosody is primarily synthesized by the TTS language module, which is a significant finding that challenges previous assumptions about TTS architecture. However, the paper could benefit from more extensive datasets and comparisons with state-of-the-art systems to further validate the claims.
The paper lacks detailed implementation information that would facilitate reproducibility. While the methodology is described, the absence of specific parameters, datasets, and code availability limits the ability of other researchers to replicate the results. Including a supplementary material section with these details would enhance the paper's reproducibility.
One limitation of the study is the potential overfitting to the datasets used for training and evaluation, which may not generalize well to all types of emotional speech. Additionally, the paper does not thoroughly address the computational efficiency of the proposed method, which is crucial for real-time applications.
The implications of this research are significant for various applications, including virtual assistants, gaming, and mental health support systems, where nuanced emotional expression can enhance user experience. The ability to generate human-like emotional speech can lead to more engaging and relatable interactions in AI systems. The paper presents a pioneering approach to emotional TTS through activation steering, significantly advancing the field by enabling composable emotional expression and challenging existing paradigms in TTS architecture. The methodology is innovative, and while the experimental results are promising, further validation and implementation details would strengthen the contributions to the field.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Primary: Meta
All Institutions: Meta, Institut Polytechnique de Paris
The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative approaches. This work significantly advances the field by providing a more coherent and integrated method for audio-visual alignment, with promising applications across multiple domains.
The proposed Conditional Flow Matching (CFM) framework represents a significant methodological advancement by reframing visually-guided acoustic highlighting as a generative problem rather than a discriminative one. This shift allows for a more nuanced approach to audio remixing, addressing the inherent ambiguities present in the task. The introduction of a rollout loss to mitigate prediction errors during iterative flow-based generation is a clever solution to the problem of trajectory drift, enhancing the stability of the model. The conditioning module that integrates audio and visual cues is also a noteworthy innovation that enables more effective cross-modal source selection.
The paper provides extensive quantitative and qualitative evaluations, demonstrating that the CFM framework consistently outperforms existing state-of-the-art methods. The experimental design appears robust, utilizing a variety of datasets to validate the effectiveness of the proposed approach. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated upon to strengthen the findings.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While the methodology is described, there are no links to code repositories or supplementary materials that would allow other researchers to replicate the experiments. Providing such resources would significantly enhance the paper's impact and utility in the research community.
One limitation is the potential for the model to overfit to the training data, especially given the complexity of the generative task. Additionally, the paper does not address the computational efficiency of the proposed method, which could be a concern for real-time applications. The reliance on visual cues may also limit the model's applicability in scenarios where visual information is not available or is of low quality.
The implications of this research are substantial, particularly in fields such as multimedia content creation, virtual reality, and assistive technologies for the hearing impaired. By improving the alignment of audio and visual elements, the proposed framework could enhance user experiences in various applications, making it a valuable contribution to the intersection of audio processing and machine learning. The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative approaches. This work significantly advances the field by providing a more coherent and integrated method for audio-visual alignment, with promising applications across multiple domains.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Primary: Institut Polytechnique de Paris
All Institutions: Meta, Institut Polytechnique de Paris
The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative models. The innovative methodology, combined with promising experimental results, positions this work as a significant advancement in the intersection of audio and visual machine learning.
The proposed Conditional Flow Matching (CFM) framework represents a significant methodological shift from traditional discriminative models to a generative approach for visually-guided acoustic highlighting. The introduction of a rollout loss to mitigate error propagation in iterative flow-based generation is an innovative solution to a common problem in generative modeling. Additionally, the conditioning module that integrates audio and visual cues before vector field regression is a thoughtful enhancement that allows for explicit cross-modal source selection, which is crucial for the task at hand.
The authors conducted extensive quantitative and qualitative evaluations, demonstrating that their method consistently outperforms the previous state-of-the-art discriminative approach. However, the paper would benefit from a more detailed description of the datasets used, including their size, diversity, and relevance to the task. The evaluation metrics employed should also be clearly defined to allow for reproducibility and comparison with future work.
The paper lacks sufficient implementation details that would allow other researchers to reproduce the results. While the methodology is described, specifics regarding hyperparameters, training procedures, and the computational resources used are not provided. Including a supplementary material section with this information or a link to a code repository would significantly enhance reproducibility.
One limitation of the proposed method is its reliance on the quality of the visual input, which may not always be reliable in real-world scenarios. Additionally, the complexity of the model may lead to longer inference times, which could be a drawback for real-time applications. The authors should also address potential overfitting issues, especially given the generative nature of the approach.
The implications of this research extend beyond audio-visual alignment, potentially influencing fields such as multimedia content creation, augmented reality, and assistive technologies for the hearing impaired. By improving the coherence between audio and visual stimuli, this work could enhance user experiences in various applications, making it a valuable contribution to the field. The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative models. The innovative methodology, combined with promising experimental results, positions this work as a significant advancement in the intersection of audio and visual machine learning.
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of physiological sounds caused by the occlusion effect; however, existing approaches often fail under real-world noise or rely on computationally expensive models. We present EarResp-ANS, the first system enabling fully on-device, real-time RR estimation on commercial earphones. The system employs LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components, without requiring neural networks or audio streaming, thereby explicitly addressing the energy and privacy constraints of wearable devices. We evaluate EarResp-ANS in a study with 18 participants under realistic acoustic conditions, including music, cafeteria noise, and white noise up to 80 dB SPL. EarResp-ANS achieves robust performance with a global MAE of 0.84 CPM , reduced to 0.47 CPM via automatic outlier rejection, while operating with less than 2% processor load directly on the earphone.
Primary: Karlsruhe Institute of Technology
All Institutions: Karlsruhe Institute of Technology
The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
The methodology presented in EarResp-ANS is innovative, leveraging LMS-based adaptive noise suppression to enhance the accuracy of respiration rate estimation from in-ear audio signals. The decision to avoid neural networks and audio streaming is commendable, as it addresses energy efficiency and privacy concerns, which are critical in wearable technology. The paper provides a clear description of the signal processing techniques used, although further details on the implementation specifics would enhance understanding.
The experimental setup is robust, involving 18 participants and testing under various realistic acoustic conditions. The reported results, including a global MAE of 0.84 CPM and improved performance with outlier rejection, demonstrate the system's effectiveness. However, the sample size could be considered limited for broader generalizability, and additional metrics could provide a more comprehensive performance evaluation.
The paper lacks sufficient detail regarding the implementation of the system, which could hinder reproducibility. While the methodology is described, specific parameters, configurations, and the dataset used for training and validation are not thoroughly detailed, making it challenging for other researchers to replicate the study.
One limitation is the relatively small participant pool, which may not capture the variability in respiration rates across different demographics. Additionally, the performance under extreme noise conditions could be further explored, as the current evaluation focuses on a limited range of acoustic environments.
The potential applications of this technology are significant, particularly in health monitoring and wellness, as it allows for unobtrusive and continuous monitoring of a vital sign that is often overlooked. The system's design prioritizes user privacy and energy efficiency, making it suitable for widespread adoption in consumer devices. The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary benchmark task suites: NatHEAR and RealSELD. Our results demonstrate that GRAM outperforms all state-of-the-art self-supervised audio foundation models on NatHEAR and the clean, single-channel version HEAR, while using only a fraction of the training data. GRAM also shows state-of-the-art localization performance in simulated environments and generalizes efficiently to real-world recordings in RealSELD. Taken together, GRAM presents a significant advance toward robust spatial audio foundation models for real-world environments.
Primary: Donders Institute, Radboud University
All Institutions: Donders Institute, Radboud University, Mortimer B Zuckerman Institute, Columbia University
The paper presents GRAM, a significant advancement in spatial audio representation, demonstrating state-of-the-art performance in real-world environments while addressing the limitations of existing audio foundation models. The comprehensive methodology and rigorous evaluation contribute to its potential impact on the field of machine learning and audio processing.
The paper presents GRAM, a multi-channel masked autoencoder designed to learn spatial audio representations. The methodology is well-structured, employing a novel training pipeline that utilizes high-quality simulations of real-world sound environments. The use of a masked autoencoder to reconstruct spatial audio features is innovative, particularly in the context of audio foundation models, which typically overlook spatial dimensions. The introduction of two benchmark suites, NatHEAR and RealSELD, adds significant value by providing standardized evaluation metrics for audio models in complex environments.
The experiments are comprehensive, comparing GRAM against state-of-the-art models across various tasks in both simulated and real-world environments. The results demonstrate GRAM's superior performance in sound localization and general-purpose audio representation tasks, achieving state-of-the-art results while requiring less training data. The inclusion of ablation studies further strengthens the evaluation by providing insights into the impact of different model components and training strategies.
The paper provides sufficient details regarding the training process, model architecture, and evaluation metrics, which enhances reproducibility. The authors have made their code and datasets available, which is a positive aspect for the community. However, some specific hyperparameter settings and configurations could be more explicitly detailed to facilitate easier replication of results.
One limitation noted is the inadequate resolution of mel-spectrograms for binaural inputs, which may have impacted localization performance. Additionally, while the model shows promise in real-world applications, its performance in highly complex acoustic environments with significant noise interference remains to be fully explored.
The advancements made by GRAM could significantly impact various applications, including audio-visual scene understanding, robotics, and ambient intelligence systems. By improving the robustness of audio models in real-world environments, this work could enhance user experiences in smart environments and contribute to the development of more sophisticated auditory perception systems. The paper presents GRAM, a significant advancement in spatial audio representation, demonstrating state-of-the-art performance in real-world environments while addressing the limitations of existing audio foundation models. The comprehensive methodology and rigorous evaluation contribute to its potential impact on the field of machine learning and audio processing.
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this work is the introduction of a dual-pathway audio encoder that effectively captures both semantic and dynamic audio features for improved video highlight detection. This innovative approach not only sets a new benchmark in performance but also addresses critical limitations in existing methodologies, paving the way for future research in audio-visual learning.
The proposed methodology, DAViHD, introduces a dual-pathway audio encoder that effectively disentangles audio signals into semantic and dynamic components. This innovative approach allows for a more nuanced understanding of audio features, addressing a significant gap in existing models that often overlook the dynamic characteristics of sound. The use of frequency-adaptive mechanisms and the integration of self-attention in the audio feature fusion process are notable advancements that enhance the model's ability to capture salient moments in videos.
The experimental setup is robust, utilizing large-scale datasets (TVSum and Mr.HiSum) to validate the proposed model. The results demonstrate significant improvements over baseline models, achieving state-of-the-art performance metrics. The thorough comparison against various existing methods, including both audio-visual and visual-only models, strengthens the credibility of the findings. Additionally, the ablation studies provide clear insights into the contributions of different components of the model.
The paper provides detailed implementation details, including the architecture of the model, training parameters, and the datasets used. This level of transparency is crucial for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
While the paper presents a compelling case for the dual-pathway approach, it does not extensively discuss potential limitations or scenarios where the model may underperform. Additionally, the reliance on pre-trained models for feature extraction could introduce biases from those models, which should be acknowledged.
The advancements in audio-visual highlight detection have significant implications for various applications, including content summarization, video retrieval, and recommendation systems. By improving the understanding of audio dynamics, this research could enhance user experiences in multimedia applications, making it a valuable contribution to the field. The main contribution of this work is the introduction of a dual-pathway audio encoder that effectively captures both semantic and dynamic audio features for improved video highlight detection. This innovative approach not only sets a new benchmark in performance but also addresses critical limitations in existing methodologies, paving the way for future research in audio-visual learning.
Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland, Universitรฉ PSL, Universitรฉ de Paris, University of Chinese Academy of Sciences, University of Toronto
The WST-X series presents a novel and effective approach to speech deepfake detection by leveraging wavelet scattering transforms and self-supervised learning features. This work significantly advances the field by addressing the critical need for interpretable and robust detection methods in audio forensics.
The paper introduces the WST-X series, a novel approach that effectively combines wavelet scattering transforms with self-supervised learning features for speech deepfake detection. The methodology is well-structured, detailing the theoretical foundations of the wavelet scattering transform and its integration with SSL features. The dual-branch architecture (WST-X1 and WST-X2) is innovative, allowing for both parallel and cascaded processing of features, which enhances the model's ability to capture subtle acoustic artifacts. The careful selection of parameters (J, Q, M for 1D and J, L, M for 2D) demonstrates a thorough understanding of the underlying signal characteristics and their relevance to deepfake detection.
The experimental setup is robust, utilizing the challenging Deepfake-Eval-2024 dataset, which is representative of real-world scenarios. The performance metrics chosen (minDCF, EER, F1-score, AUC) are appropriate for evaluating the effectiveness of the proposed methods. The results indicate significant performance improvements over traditional feature extraction methods, showcasing the advantages of the WST-X series in capturing fine-grained spectral anomalies. However, the paper could benefit from more extensive comparisons with other state-of-the-art methods beyond the baseline features mentioned.
The paper provides sufficient detail on the implementation of the WST-X series, including the choice of libraries (Kymatio, Librosa) and model configurations. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider making the code accessible to facilitate further research and validation.
One limitation is the reliance on the Deepfake-Eval-2024 dataset, which may not encompass all potential variations in deepfake generation techniques. Additionally, while the paper emphasizes interpretability, the complexity of the model may still pose challenges in fully understanding the decision-making process of the classifier. The paper does not address potential overfitting issues that may arise from the high-dimensional feature space.
The proposed WST-X series has significant implications for audio forensics and the detection of deepfake technologies, which are increasingly relevant in today's digital landscape. By improving the interpretability and robustness of speech deepfake detection systems, this work contributes to the ongoing efforts to combat misinformation and ensure the integrity of audio content. The WST-X series presents a novel and effective approach to speech deepfake detection by leveraging wavelet scattering transforms and self-supervised learning features. This work significantly advances the field by addressing the critical need for interpretable and robust detection methods in audio forensics.
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Ping An Technology (Shenzhen) Co
The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
The proposed PL-Distill framework introduces a dual-level knowledge distillation approach that effectively addresses the challenges of distilling large audio-language models for speech emotion recognition. The incorporation of Attention-weighted Centered Kernel Alignment (AwCKA) is particularly innovative, as it dynamically prioritizes important audio tokens based on attention scores, thereby enhancing the alignment of audio embeddings despite dimensional mismatches. This methodological advancement is well-justified in the context of previous work and represents a significant contribution to the field of knowledge distillation in multimodal models.
The experimental evaluation is robust, utilizing three widely recognized datasets (IEMOCAP, RAVDESS, and SAVEE) to validate the effectiveness of the proposed method. The results demonstrate that PL-Distill not only compresses the teacher model significantly but also outperforms both the teacher and state-of-the-art models across all metrics. The ablation studies further substantiate the contributions of each component of the framework, providing a clear understanding of the impact of the proposed methods.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully generalize to other SER tasks or datasets. Additionally, while the method shows promise, the computational efficiency of the distillation process itself could be further explored to ensure practical applicability in real-world scenarios.
The implications of this research extend beyond speech emotion recognition, as the PL-Distill framework could be adapted for various audio-language tasks, potentially improving the efficiency of deploying large models in resource-constrained environments. The focus on effective knowledge transfer in multimodal contexts may also inspire future research in related areas. The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at https://github.com/HuPER29/HuPER.
Primary: University of California, Berkeley
All Institutions: Zhejiang University, University of California, Berkeley
HuPER presents a novel framework for phonetic perception that integrates adaptive inference with acoustic and linguistic knowledge, achieving state-of-the-art performance with limited training data. The methodology is robust, and the implications for practical applications in speech technology are substantial, marking a significant advancement in the field.
The methodology proposed in HuPER is innovative as it frames phonetic perception as adaptive inference, integrating both acoustic-phonetic evidence and linguistic knowledge. The four-stage training pipeline is well-structured, starting from a small annotated corpus and leveraging a larger transcript-only corpus for pseudo-label generation. The use of a Corrector model to learn edit operations is particularly noteworthy, as it enhances the robustness of the phonetic recognizer. This adaptive approach allows for multi-path phonetic perception under varying acoustic conditions, which is a significant advancement in the field.
The experiments conducted are comprehensive, with the framework achieving state-of-the-art phonetic error rates across five English benchmarks and demonstrating strong zero-shot transfer capabilities to 95 unseen languages. The choice of datasets and benchmarks appears appropriate for validating the performance claims. However, more detailed comparisons with existing state-of-the-art methods would strengthen the evaluation.
The authors have made all training data, models, and code open-sourced, which is commendable and enhances reproducibility. The provided GitHub repository allows other researchers to replicate the experiments and build upon the work. However, additional documentation on the training process and hyperparameter settings would further facilitate reproducibility.
One limitation of the study is the reliance on the initial small human-annotated corpus, which may not capture the full diversity of phonetic variations across different languages. Additionally, while the zero-shot transfer to 95 languages is impressive, the paper does not provide extensive analysis on the performance across these languages, which could vary significantly in phonetic structure.
The potential applications of HuPER are vast, particularly in assistive technologies for education, healthcare, and accessibility. By improving the reliability of phonetic representations, the framework could lead to more effective communication tools for diverse populations. The work also lays a foundation for future developments in speech generation systems, making it a significant contribution to the field of speech and language technologies. HuPER presents a novel framework for phonetic perception that integrates adaptive inference with acoustic and linguistic knowledge, achieving state-of-the-art performance with limited training data. The methodology is robust, and the implications for practical applications in speech technology are substantial, marking a significant advancement in the field.
Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics, including global and local pitch deviations, energy consistency, and speaker similarity, compared to prior approaches.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this work is the introduction of LipSody, a novel lip-to-speech synthesis framework that enhances prosody consistency through a multi-faceted approach to visual input. This paper represents a meaningful advancement in the field of audio synthesis, providing a robust methodology and comprehensive evaluation that could influence future research and applications in multimodal speech synthesis.
The methodology presented in LipSody is innovative, leveraging a diffusion-based framework to enhance prosody consistency in lip-to-speech synthesis. The authors introduce a novel prosody-guiding strategy that integrates speaker identity, linguistic content, and emotional context, which is a significant advancement over previous models that primarily focused on intelligibility. The use of complementary cues for prosody estimation is a thoughtful approach that enhances the model's ability to generate more natural and expressive speech. The architecture is well-structured, utilizing established deep learning techniques while introducing new components like the Emotion Encoder to refine prosody prediction.
The experimental evaluation is thorough, utilizing a large dataset (LRS3) and employing both objective and subjective metrics to assess performance. The results demonstrate significant improvements in prosody-related metrics compared to prior models, while maintaining intelligibility. The use of statistical tests to validate the significance of improvements adds rigor to the findings. However, the paper could benefit from additional comparisons with more recent models beyond LipVoicer to contextualize its contributions further.
The paper provides detailed implementation specifics, including model architecture, training protocols, and evaluation metrics, which support reproducibility. The authors mention using publicly available codebases for components like the Emotion Encoder and vocoder, which enhances the potential for others to replicate their work. However, the lack of a publicly available code repository for the entire LipSody framework limits full reproducibility.
One limitation is the reliance on the LRS3 dataset, which may not encompass the full diversity of lip movements and emotional expressions found in real-world scenarios. Additionally, while the model shows improvements in prosody consistency, the subjective evaluations indicate that the differences in naturalness are not statistically significant, suggesting that further enhancements could be explored. The model's performance in diverse acoustic environments or with different speaker demographics remains untested.
LipSody has significant potential applications in areas such as assistive technologies for the hearing impaired, silent communication tools, and enhancing multimedia content accessibility. The ability to generate expressive and personalized speech from visual input could also benefit virtual avatars and gaming industries, where realistic character interactions are crucial. The advancements in prosody consistency could lead to more engaging and relatable AI-generated speech, fostering better human-computer interactions. The main contribution of this work is the introduction of LipSody, a novel lip-to-speech synthesis framework that enhances prosody consistency through a multi-faceted approach to visual input. This paper represents a meaningful advancement in the field of audio synthesis, providing a robust methodology and comprehensive evaluation that could influence future research and applications in multimodal speech synthesis.
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
Primary: University of Illinois, Urbana-Champaign
All Institutions: University of Illinois, Urbana-Champaign, AWS AI Labs
The main contribution of this paper is the development of a masked autoencoder framework for universal speech enhancement that effectively handles multiple distortions through self-supervised learning. This work presents a novel approach that not only advances the state of the art in speech enhancement but also opens avenues for further research in self-supervised learning applications in audio processing.
The paper introduces a masked autoencoder framework for speech enhancement that is both self-supervised and capable of handling various distortions. The methodology is well-structured, leveraging an augmentation stack to introduce additional noise, which is a clever approach to pre-training. The dual focus on denoising and dereverberation tasks demonstrates versatility. However, the paper could benefit from a more thorough comparison with existing methods beyond the baseline, as well as a clearer explanation of the specific architecture choices made in the masked autoencoder.
The experiments are comprehensive, evaluating the model on both in-domain and out-of-domain datasets, which is crucial for assessing generalizability. The results indicate that the proposed method achieves state-of-the-art performance, which is a significant contribution. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance testing, which would strengthen the claims made.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the methodology is described, the absence of a clear protocol for reproducing the results limits the ability of other researchers to validate the findings.
One limitation is the reliance on a small amount of paired data for fine-tuning, which may not be feasible in all practical scenarios. Additionally, the paper does not address potential biases in the datasets used for evaluation, which could affect the generalizability of the results.
The proposed method has significant implications for real-world applications in speech enhancement, particularly in environments with varying types of noise. The ability to enhance speech across different distortions makes it a valuable tool for improving communication in challenging acoustic settings, such as in teleconferencing or assistive technologies for the hearing impaired. The main contribution of this paper is the development of a masked autoencoder framework for universal speech enhancement that effectively handles multiple distortions through self-supervised learning. This work presents a novel approach that not only advances the state of the art in speech enhancement but also opens avenues for further research in self-supervised learning applications in audio processing.
Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech enhancement (ParaGSE) framework that leverages a group vector quantization (GVQ)-based neural speech codec. The GVQ-based codec adopts separate VQs to produce mutually independent tokens, enabling efficient parallel token prediction in ParaGSE. Specifically, ParaGSE leverages the GVQ-based codec to encode degraded speech into distinct tokens, predicts the corresponding clean tokens through parallel branches conditioned on degraded spectral features, and ultimately reconstructs clean speech via the codec decoder. Experimental results demonstrate that ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines, under a wide range of distortions including noise, reverberation, band-limiting, and their mixtures. Furthermore, empowered by parallel computation in token prediction, ParaGSE attains about a 1.5-fold improvement in generation efficiency on CPU compared with serial generative speech enhancement approaches.
Primary: University of Science and Technology of China
All Institutions: National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China
The paper presents ParaGSE, a novel framework for parallel generative speech enhancement that leverages a GVQ-based neural speech codec to achieve significant improvements in speech quality and processing efficiency. The technical contributions are substantial, addressing key challenges in the field and demonstrating the potential for practical applications in real-world scenarios.
The proposed methodology, ParaGSE, introduces a novel framework for generative speech enhancement that utilizes a group vector quantization (GVQ)-based neural speech codec. This approach is innovative in its use of separate VQs for independent token generation, which facilitates efficient parallel computation. The architecture is well-structured, employing a combination of convolutional layers, BiLSTM, and Conformer blocks to extract features and predict clean tokens. The methodology is sound, with a clear explanation of the components and their interactions, although it could benefit from more detailed comparisons with existing methods in terms of computational complexity.
The experimental evaluation is robust, featuring a comprehensive set of experiments that assess the performance of ParaGSE against various baseline models across multiple distortion types. The paper includes both objective and subjective metrics, providing a well-rounded view of the model's effectiveness. The dataset construction is thorough, utilizing real-world noise and reverberation conditions, which enhances the relevance of the findings. However, the paper could improve by including more detailed statistical analyses of the results and discussing the significance of the findings more explicitly.
The paper provides sufficient implementation details, including the architecture, training criteria, and experimental setup, which aids in reproducibility. The availability of codes and speech samples on the provided URL is a positive aspect, although the lack of a direct GitHub repository may limit accessibility for some researchers.
One limitation is the potential complexity of the model, which may hinder deployment in real-time applications. Additionally, while the paper claims efficiency improvements, it does not provide a detailed comparison of the computational costs associated with the proposed method versus the baselines, which could be crucial for practical applications. There is also a noted performance gap in intrusive metrics like LSD compared to discriminative models, which could be a concern for certain applications.
The proposed ParaGSE framework has significant potential for real-world applications in speech enhancement, particularly in environments with various distortions. Its efficiency and ability to produce high-quality speech restoration could benefit communication technologies, assistive listening devices, and speech recognition systems. The advancements in generative models for speech enhancement also contribute to the broader field of audio processing and machine learning. The paper presents ParaGSE, a novel framework for parallel generative speech enhancement that leverages a GVQ-based neural speech codec to achieve significant improvements in speech quality and processing efficiency. The technical contributions are substantial, addressing key challenges in the field and demonstrating the potential for practical applications in real-world scenarios.
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
Primary: University of Addis Ababa
All Institutions: University of Addis Ababa, Makerere University, University of Ghana, Digital Umuganda, Media Trust
The WAXAL dataset represents a significant advancement in addressing the digital divide for Sub-Saharan African languages. The comprehensive methodology and ethical considerations underscore its potential to foster inclusive technological development and support linguistic diversity in speech technology.
The methodology for data collection is robust, involving partnerships with local institutions to ensure cultural relevance and linguistic accuracy. The use of image-prompted speech for ASR data collection is innovative, as it encourages more natural speech patterns compared to scripted readings. The detailed steps in the transcription and quality control processes further enhance the dataset's reliability. The TTS data collection is also well-structured, focusing on high-quality recordings in a controlled environment. The collaborative approach with local experts is commendable and addresses ethical considerations effectively.
The paper provides a comprehensive overview of the dataset, including the amount of data collected and the diversity of languages represented. However, it lacks specific experimental results demonstrating the performance of models trained on this dataset, which would have strengthened the technical impact. The statistical analysis of the dataset is thorough, providing valuable insights into its composition, but the absence of comparative evaluations with existing datasets limits the assessment of its relative quality.
The paper outlines a clear methodology for data collection and processing, which aids reproducibility. However, it does not provide implementation details or code for the data collection process, which could hinder others from replicating the study. The dataset is openly accessible, which is a positive aspect for reproducibility in research.
The paper acknowledges several limitations, including transcription coverage and dialectal representation, which are significant in the context of the linguistic diversity of the region. The potential for unintended content in the ASR dataset is also a concern, despite quality control measures. Additionally, the dataset's focus on specific languages may not fully capture the linguistic richness of Sub-Saharan Africa.
The WAXAL dataset has the potential to significantly impact the development of speech technologies for underrepresented languages, promoting inclusivity in digital communication. By providing a large-scale resource, it can catalyze research and development in ASR and TTS systems, ultimately benefiting millions of speakers of these languages. The ethical considerations addressed in the paper also highlight the importance of responsible data usage, which is crucial in the context of AI and machine learning. The WAXAL dataset represents a significant advancement in addressing the digital divide for Sub-Saharan African languages. The comprehensive methodology and ethical considerations underscore its potential to foster inclusive technological development and support linguistic diversity in speech technology.
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at https://huggingface.co/datasets/google/WaxalNLP under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
Primary: Addis Ababa University
All Institutions: Addis Ababa University, Makerere University, University of Ghana, Digital Umuganda, Media Trust
The WAXAL dataset represents a significant advancement in addressing the scarcity of speech resources for Sub-Saharan African languages. Its comprehensive methodology and potential for fostering inclusive technologies underscore its importance in the field of machine learning and speech technology.
The methodology for data collection is robust, involving partnerships with local institutions to ensure cultural relevance and linguistic accuracy. The use of image-prompted speech for ASR data collection is innovative, as it encourages more natural speech compared to traditional scripted methods. The TTS data collection also follows a well-structured approach with phonetically balanced scripts and professional recording environments, which enhances the quality of the dataset.
The paper provides a comprehensive overview of the dataset, including the amount of data collected for each language and the diversity of speakers. However, it lacks detailed experimental results demonstrating the performance of models trained on the WAXAL dataset. While the dataset itself is a significant contribution, the absence of empirical evaluations limits the assessment of its effectiveness in real-world applications.
The paper outlines the data collection process and quality control measures, which are essential for reproducibility. However, it would benefit from providing more detailed information on the specific tools and techniques used for transcription and annotation, as well as any baseline models tested on the dataset.
The authors acknowledge several limitations, including the potential for dialectal representation issues and the risk of unintended content in the ASR dataset. Additionally, the ASR dataset is not well-suited for training high-quality single-speaker TTS models, which may restrict its applicability in certain contexts.
The WAXAL dataset has the potential to significantly impact the development of speech technologies for underrepresented languages in Sub-Saharan Africa. By providing a large-scale, openly accessible resource, it can catalyze research and development efforts aimed at bridging the digital divide for speakers of these languages. The ethical considerations addressed in the paper also highlight the importance of responsible data handling and community involvement. The WAXAL dataset represents a significant advancement in addressing the scarcity of speech resources for Sub-Saharan African languages. Its comprehensive methodology and potential for fostering inclusive technologies underscore its importance in the field of machine learning and speech technology.
The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from "garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model's loss reacting positively to these perturbations, specifically a sharp increase ("Peak" area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve -- rather than its absolute value -- encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality -- opening the door to more principled training objectives and sharper benchmarks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a new evaluation framework for music LLMs based on the dynamics of loss curves, which reveals critical insights into model behavior and challenges existing evaluation paradigms. The findings underscore the need for a shift in how we assess musical quality in generative models, emphasizing the importance of understanding local dynamics over absolute loss values.
The paper introduces a novel methodology through the noise injection experiment, which challenges the conventional understanding of likelihood-based evaluation in music LLMs. The approach is well-structured, employing controlled noise perturbations to investigate the loss dynamics of models. The identification of the "Context Amnesia Effect" is a significant conceptual contribution, providing a new lens through which to understand model behavior in the presence of noise. The methodology is rigorous, with clear definitions and a systematic approach to analyzing loss dynamics.
The experiments are comprehensive, utilizing multiple datasets and various types of noise injections to validate the findings. The statistical analyses, including Pearson and Spearman correlation tests, lend credibility to the results, demonstrating significant trends across different models and datasets. However, the reliance on specific datasets, such as the ShutterStock corpus, may limit the generalizability of the findings to broader music contexts.
The paper provides sufficient detail regarding the experimental setup, including the parameters used for noise injection and the models evaluated. However, the lack of explicit information on the datasets and the absence of a publicly available code repository may hinder full reproducibility. The demo page offers some interactive elements, but a complete code release would enhance reproducibility.
One limitation is the focus on specific types of perturbations (noise and order shuffling), which may not encompass all forms of musical corruption. Additionally, the findings may not fully address the complexities of human judgment in music evaluation, as the study primarily relies on model behavior rather than direct comparisons with human assessments.
The implications of this work are significant for the field of music generation and evaluation. By highlighting the limitations of likelihood-based metrics, the paper paves the way for developing more robust evaluation frameworks that align better with human perceptions of musical quality. This could lead to advancements in training objectives for music LLMs and improve the overall quality of generated music. The main contribution of this paper is the introduction of a new evaluation framework for music LLMs based on the dynamics of loss curves, which reveals critical insights into model behavior and challenges existing evaluation paradigms. The findings underscore the need for a shift in how we assess musical quality in generative models, emphasizing the importance of understanding local dynamics over absolute loss values.
Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator's task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.
Primary: Tsinghua University
All Institutions: Tsinghua University, Dalian Maritime University, Shenzhen International Graduate School
The main contribution of this paper is the introduction of TLDiffGAN, a novel framework that integrates latent diffusion models with GANs for improved anomalous sound detection. This work significantly advances the state of the art in the field by addressing key limitations of existing generative models and demonstrating superior performance through rigorous experimental validation.
The proposed TLDiffGAN framework innovatively combines latent diffusion models with GANs to enhance the quality of generated spectrograms for anomalous sound detection. The dual-branch architecture effectively integrates features from both raw audio and Mel spectrograms, addressing the limitations of traditional single-modality approaches. The introduction of the TMixup technique to augment temporal features is a significant methodological advancement, enhancing the model's sensitivity to subtle anomalies. However, the complexity of the model may pose challenges in terms of interpretability and practical deployment.
The experiments conducted on the DCASE 2020 Challenge Task 2 dataset are extensive and demonstrate a clear improvement over existing methods in terms of AUC and pAUC metrics. The comparative analysis with other state-of-the-art methods provides strong evidence for the effectiveness of TLDiffGAN. The ablation studies further validate the contributions of each component, reinforcing the robustness of the proposed framework.
The paper provides detailed implementation details, including network configurations, training protocols, and evaluation metrics, which support reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which others can replicate the results.
One limitation is the reliance on a specific dataset (DCASE 2020) for evaluation, which may not fully capture the generalizability of the model across different domains or types of anomalous sounds. Additionally, the model's complexity could lead to challenges in real-time applications, particularly in resource-constrained environments.
The framework has significant implications for industrial applications, particularly in predictive maintenance and monitoring of machinery, where timely detection of anomalies can prevent failures and reduce downtime. The ability to localize anomalies in the time-frequency domain enhances interpretability, which is crucial for practitioners in the field. The main contribution of this paper is the introduction of TLDiffGAN, a novel framework that integrates latent diffusion models with GANs for improved anomalous sound detection. This work significantly advances the state of the art in the field by addressing key limitations of existing generative models and demonstrating superior performance through rigorous experimental validation.
We introduce and define a novel task-Scene-Aware Visually-Driven Speech Synthesis, aimed at addressing the limitations of existing speech generation models in creating immersive auditory experiences that align with the real physical world. To tackle the two core challenges of data scarcity and modality decoupling, we propose VividVoice, a unified generative framework. First, we constructed a large-scale, high-quality hybrid multimodal dataset, Vivid-210K, which, through an innovative programmatic pipeline, establishes a strong correlation between visual scenes, speaker identity, and audio for the first time. Second, we designed a core alignment module, D-MSVA, which leverages a decoupled memory bank architecture and a cross-modal hybrid supervision strategy to achieve fine-grained alignment from visual scenes to timbre and environmental acoustic features. Both subjective and objective experimental results provide strong evidence that VividVoice significantly outperforms existing baseline models in terms of audio fidelity, content clarity, and multimodal consistency. Our demo is available at https://chengyuann.github.io/VividVoice/.
Primary: Tsinghua University
All Institutions: Tsinghua University, Ant Group, Shenzhen International Graduate School
The main contribution of this paper is the introduction of a novel framework for scene-aware visually-driven speech synthesis, which significantly advances the field by addressing key challenges in multimodal alignment and data scarcity. The technical contributions, particularly the innovative dataset and alignment module, position this work as a meaningful advancement in audio synthesis research, although further detail on implementation and broader applicability is needed.
The proposed methodology, VividVoice, introduces a unified generative framework that addresses the challenges of data scarcity and modality decoupling in speech synthesis. The construction of the Vivid-210K dataset is a significant contribution, as it establishes a novel correlation between visual scenes, speaker identity, and audio. The D-MSVA alignment module is innovative, utilizing a decoupled memory bank architecture and hybrid supervision strategy, which enhances the model's ability to align visual and auditory modalities effectively. However, the paper could benefit from a more detailed description of the implementation and the specific algorithms used within the D-MSVA module.
The experimental evaluation is robust, featuring both subjective and objective assessments that demonstrate the superiority of VividVoice over existing baseline models. The results indicate improvements in audio fidelity, content clarity, and multimodal consistency, which are critical metrics in speech synthesis. However, the paper lacks a comprehensive comparison with a wider range of baseline models and does not provide enough detail on the experimental setup, such as the number of participants in subjective tests or the specific metrics used for objective evaluation.
The paper does not provide sufficient implementation details that would facilitate reproducibility. While it mentions the construction of the Vivid-210K dataset and the D-MSVA module, it lacks code availability or a clear description of the training process, hyperparameters, and evaluation protocols. This limits the ability of other researchers to replicate the results and build upon this work.
One limitation is the reliance on a single dataset (Vivid-210K), which may not generalize across different contexts or speaker demographics. Additionally, the paper does not address potential biases in the dataset or the implications of using a specific set of visual scenes. The complexity of the D-MSVA module may also pose challenges for real-time applications, which are critical in practical speech synthesis scenarios.
The implications of Scene-Aware Visually-Driven Speech Synthesis are significant, particularly in applications such as virtual reality, gaming, and assistive technologies. By creating more immersive auditory experiences that align with visual contexts, this research can enhance user engagement and accessibility. However, ethical considerations regarding the use of such technology, particularly in terms of deepfakes or misinformation, should be addressed. The main contribution of this paper is the introduction of a novel framework for scene-aware visually-driven speech synthesis, which significantly advances the field by addressing key challenges in multimodal alignment and data scarcity. The technical contributions, particularly the innovative dataset and alignment module, position this work as a meaningful advancement in audio synthesis research, although further detail on implementation and broader applicability is needed.
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.
Primary: University of Melbourne
All Institutions: University of Melbourne
The main contribution of this paper is the introduction of HierCon, a hierarchical contrastive attention framework that significantly improves audio deepfake detection by effectively modeling temporal and inter-layer dependencies, thereby achieving state-of-the-art performance on benchmark datasets. This work represents a meaningful advancement in the field, addressing critical challenges in distinguishing between real and synthetic audio.
The paper introduces HierCon, a novel hierarchical layer attention framework that effectively captures temporal and inter-layer dependencies in audio deepfake detection. The methodology is well-structured, employing a three-stage attention mechanism that enhances the model's ability to discern subtle differences between real and synthetic audio. The integration of margin-based contrastive learning is particularly noteworthy, as it encourages the model to develop domain-invariant embeddings, thereby improving generalization across various deepfake generation techniques. The detailed explanation of the attention mechanism and the loss functions used provides a solid foundation for understanding the proposed approach.
The authors conduct thorough experiments on multiple datasets, including ASVspoof 2021 DF and In-the-Wild, demonstrating significant improvements over existing methods. The reported results, including Equal Error Rates (EER), clearly indicate the effectiveness of HierCon, achieving state-of-the-art performance. The inclusion of ablation studies further strengthens the findings, allowing for a clear understanding of the contributions of hierarchical attention and contrastive learning to the overall performance.
While the paper provides a comprehensive description of the methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a demo or project URL also limits the ability for others to validate the findings independently.
One limitation of the study is the reliance on specific datasets for evaluation, which may not fully capture the diversity of real-world audio deepfakes. Additionally, while the hierarchical attention mechanism is promising, the complexity of the model may pose challenges in terms of computational efficiency and scalability for real-time applications.
The implications of this research are significant, particularly in the context of security and online trust, as audio deepfakes pose increasing risks in various domains, including voice authentication and digital forensics. The proposed method has the potential to enhance the robustness of detection systems, contributing to the development of more secure communication technologies. The main contribution of this paper is the introduction of HierCon, a hierarchical contrastive attention framework that significantly improves audio deepfake detection by effectively modeling temporal and inter-layer dependencies, thereby achieving state-of-the-art performance on benchmark datasets. This work represents a meaningful advancement in the field, addressing critical challenges in distinguishing between real and synthetic audio.
This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, we reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in \ac{WER} across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream acoustic scene detection. Demo page: https://ssnapsicml.github.io/ssnapsicml2026/
Primary: Bar-Ilan University
All Institutions: Bar-Ilan University, OriginAI
The paper presents a novel unsupervised generative method for audio-visual speech separation that significantly advances the field. The technical contributions, particularly in leveraging diffusion models and visual cues, offer promising directions for future research and practical applications in speech processing.
The methodology proposed in SSNAPS is innovative, leveraging generative inverse sampling with diffusion models to separate speech from background noise in an unsupervised manner. The paper reformulates existing inverse sampling techniques to accommodate multiple independent signals and integrates visual cues from lip movements to enhance separation accuracy. The approach's flexibility in handling varying numbers of speakers and the introduction of a novel loss function for off-screen speaker separation are significant advancements. However, the reliance on visual data may limit applicability in scenarios without such cues.
The experimental evaluation is robust, with comprehensive testing on mixtures of 1, 2, and 3 speakers across different noise conditions. The results demonstrate that SSNAPS consistently outperforms leading supervised baselines in terms of word error rate (WER), showcasing the effectiveness of the unsupervised approach. The paper provides detailed metrics and comparisons, enhancing the credibility of the findings. However, the paper could benefit from additional qualitative assessments or user studies to further validate the audio quality of the separated signals.
The paper includes sufficient implementation details, including datasets, model architectures, and hyperparameters, which facilitate reproducibility. The authors provide a demo page, but the absence of a public code repository limits the ability for others to reproduce the results independently. The detailed explanation of the experimental setup and evaluation metrics is commendable, yet sharing the actual code would enhance transparency.
One key limitation is the dependency on visual data for performance, which may not be available in all real-world applications. Additionally, while the method shows promise in separating speech from noise, the computational efficiency could be improved, as indicated by the longer runtime compared to supervised methods. The paper also does not address potential challenges in scenarios with more complex noise environments or overlapping speech characteristics.
The advancements presented in this paper have significant implications for various applications, including telecommunication, assistive technologies for the hearing impaired, and audio-visual media processing. By improving speech separation in noisy environments, the method could enhance user experiences in real-world settings, making communication clearer and more effective. The unsupervised nature of the approach also suggests potential for broader adoption in diverse applications without the need for extensive labeled datasets. The paper presents a novel unsupervised generative method for audio-visual speech separation that significantly advances the field. The technical contributions, particularly in leveraging diffusion models and visual cues, offer promising directions for future research and practical applications in speech processing.
Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.
Primary: unknown
All Institutions: unknown
The paper presents a novel adaptation framework for multilingual speech recognition that leverages a structured analysis of layer-wise adaptability, significantly improving efficiency and performance in low-resource language settings. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential to advance speech technology accessibility.
The proposed Depth-Aware Model Adaptation (DAMA) framework introduces a novel approach to multilingual ASR by analyzing layer-wise adaptability and implementing a U-shaped adaptability pattern. This structured adaptation strategy effectively allocates training resources, enhancing efficiency and performance in low-resource language scenarios. The integration of SVD-based initialization and Basis-Protected Projection further solidifies the method's robustness, allowing for effective adaptation while preserving essential language-agnostic representations.
The experiments conducted on 18 low-resource languages using two benchmark datasets (Common Voice and FLEURS) demonstrate the effectiveness of DAMA. The results indicate that DAMA not only matches or surpasses state-of-the-art performance but also significantly reduces the number of trainable parameters and computational costs. The thorough evaluation across different languages and settings adds credibility to the findings, showcasing the framework's adaptability and efficiency.
The paper provides detailed implementation details, including the datasets used, experimental setup, and hyperparameter settings, which facilitate reproducibility. However, the lack of a publicly available code repository limits the ease of replication for external researchers.
While the study reveals significant findings, it is limited to 18 languages, and the generalizability of the U-shaped adaptability pattern across even more diverse languages remains to be tested. Additionally, the method is optimized for low-resource settings, which may not translate to high-resource scenarios without further adjustments.
The findings have the potential to significantly enhance multilingual speech recognition technologies, particularly for low-resource languages, thereby promoting inclusivity in speech technology applications. This could lead to broader accessibility and usability of speech recognition systems in diverse linguistic contexts. The paper presents a novel adaptation framework for multilingual speech recognition that leverages a structured analysis of layer-wise adaptability, significantly improving efficiency and performance in low-resource language settings. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential to advance speech technology accessibility.
Emotion recognition from human speech is a critical enabler for socially aware conversational AI. However, while most prior work frames emotion recognition as a categorical classification problem, real-world affective states are often ambiguous, overlapping, and context-dependent, posing significant challenges for both annotation and automatic modeling. Recent large-scale audio language models (ALMs) offer new opportunities for nuanced affective reasoning without explicit emotion supervision, but their capacity to handle ambiguous emotions remains underexplored. At the same time, advances in inference-time techniques such as test-time scaling (TTS) have shown promise for improving generalization and adaptability in hard NLP tasks, but their relevance to affective computing is still largely unknown. In this work, we introduce the first benchmark for ambiguous emotion recognition in speech with ALMs under test-time scaling. Our evaluation systematically compares eight state-of-the-art ALMs and five TTS strategies across three prominent speech emotion datasets. We further provide an in-depth analysis of the interaction between model capacity, TTS, and affective ambiguity, offering new insights into the computational and representational challenges of ambiguous emotion understanding. Our benchmark establishes a foundation for developing more robust, context-aware, and emotionally intelligent speech-based AI systems, and highlights key future directions for bridging the gap between model assumptions and the complexity of real-world human emotion.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a benchmark for ambiguous emotion recognition in speech using audio language models under test-time scaling. This work addresses a critical gap in the field of affective computing by exploring the complexities of real-world emotions, thereby paving the way for more nuanced and context-aware AI systems.
The methodology introduces a novel benchmark for ambiguous emotion recognition using audio language models (ALMs) and test-time scaling (TTS). The systematic comparison of eight state-of-the-art ALMs and five TTS strategies across three datasets is a significant methodological contribution, as it addresses the complexity of real-world emotional states that are often not captured in traditional categorical frameworks. The paper effectively combines existing techniques in a new context, but lacks detailed descriptions of the TTS strategies and their implementation specifics, which could enhance reproducibility.
The experiments are well-structured, utilizing multiple datasets to validate the proposed benchmark. The evaluation metrics, while not explicitly detailed in the abstract, are likely comprehensive given the context. However, the paper could benefit from more extensive quantitative results and visualizations to better illustrate the performance differences between models and TTS strategies. The inclusion of qualitative analyses or case studies could also provide deeper insights into the model's handling of ambiguous emotions.
The paper does not provide sufficient implementation details or access to code and data, which raises concerns about reproducibility. While it mentions the use of existing datasets, without clear guidelines or links to the datasets and the specific configurations used in experiments, it may be challenging for other researchers to replicate the findings.
The paper acknowledges the complexity of real-world emotions but does not fully address the limitations of the proposed methods. For instance, the reliance on existing ALMs may limit the generalizability of the findings. Furthermore, the interaction between TTS and model capacity could be explored more rigorously, as the current analysis may not capture all nuances of the performance variations.
The research has significant implications for the development of emotionally intelligent AI systems, particularly in conversational agents and social robotics. By providing a framework for understanding ambiguous emotions, this work could enhance user interactions in various applications, from customer service to mental health support. The establishment of a benchmark for ambiguous emotion recognition also opens avenues for future research in affective computing. The main contribution of this paper is the introduction of a benchmark for ambiguous emotion recognition in speech using audio language models under test-time scaling. This work addresses a critical gap in the field of affective computing by exploring the complexities of real-world emotions, thereby paving the way for more nuanced and context-aware AI systems.
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
Primary: Inria, LIRMM, Universitรฉ de Montpellier
All Institutions: Inria, LIRMM, Universitรฉ de Montpellier, Earth Species Project, University of Kassel
The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
The methodology presented in this paper is innovative as it proposes a contrastive distillation approach to bridge audio and image modalities without requiring paired data. By leveraging a pretrained image-text model (BioCLIP-2) to enhance the audio-text model (BioLingual), the authors effectively create a semantic intermediary that facilitates meaningful audio-to-image retrieval. The use of a contrastive objective for fine-tuning the audio encoder is well-justified and demonstrates a clear understanding of the underlying challenges in cross-modal representation learning. The simplicity of the approach, which avoids complex multi-objective training and direct image supervision, is a significant strength.
The experiments are robust, utilizing multiple bioacoustic benchmarks to validate the effectiveness of the proposed method. The results indicate that the distilled audio encoder not only improves audio-to-image retrieval performance but also preserves the discriminative capabilities of the audio model. The comparisons against various baselines, including zero-shot and text-embedding mapping strategies, provide a comprehensive evaluation of the method's effectiveness. The use of independent datasets for validation strengthens the credibility of the findings.
The paper mentions that the code will be publicly available after review, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics, such as hyperparameter settings, training duration, and computational resources, which are essential for other researchers to replicate the experiments fully.
One limitation of the study is the reliance on the quality and representativeness of the textual descriptions used for training the audio encoder. If the textual descriptions are not sufficiently diverse or comprehensive, it may impact the generalization of the model. Additionally, while the approach demonstrates strong performance on the evaluated datasets, its applicability to other domains or species not represented in the training data remains uncertain.
The implications of this research are significant for biodiversity monitoring and conservation efforts, particularly in scenarios where paired audio-image data is scarce. By enabling effective audio-to-image retrieval, the proposed method can assist researchers and conservationists in identifying species based on audio recordings, thus enhancing ecological studies and wildlife conservation strategies. The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology, Tsinghua University, Sun Yat-sen University
The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
The proposed Dual-View Predictive Diffusion (DVPD) model introduces a novel approach to speech enhancement by leveraging the dual nature of spectrograms as both visual textures and physical frequency-domain representations. The methodology includes the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which effectively preserves critical low-frequency harmonics while reducing high-frequency redundancies, and the Lightweight Image-based Spectro-Awareness (LISA) module, which captures features from a visual perspective. The Training-free Lossless Boost (TLB) strategy further enhances the model's performance during inference without additional training, showcasing a well-thought-out integration of predictive and generative paradigms.
The experiments are extensive, covering various benchmarks including WSJ0-UNI and VBDMD, demonstrating the model's effectiveness across different distortion scenarios. The results indicate that DVPD achieves state-of-the-art performance while significantly reducing computational complexity compared to existing models. The comprehensive evaluation metrics used, such as PESQ and ESTOI, provide a robust assessment of the model's capabilities.
The paper includes detailed implementation details, including training configurations, loss functions, and evaluation metrics, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers.
While the model demonstrates impressive performance, it may still struggle with certain types of distortions not covered in the training datasets. Additionally, the reliance on specific hyperparameters for the TLB strategy may introduce variability in performance across different applications.
The advancements presented in this paper have significant implications for real-world applications in speech enhancement, particularly in noisy environments. The lightweight nature of the model makes it suitable for deployment in resource-constrained settings, potentially benefiting various industries, including telecommunications and assistive technologies. The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints -- we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
Primary: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences
All Institutions: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Department of Automation, Tsinghua University, Beijing National Research Center for Information Science and Technology, Tsinghua University
The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
The proposed methodology introduces a novel framework for text-based speech editing that effectively decouples semantic content from acoustic features, addressing the limitations of existing methods that often lead to artifacts and instability. The use of a Flow Matching decoder for acoustic reconstruction and a Self-Consistency Rewards mechanism for perceptual alignment is innovative and well-justified, leveraging a pre-trained TTS model as an implicit critic. This dual-stage approach enhances both intelligibility and naturalness, making significant strides in the field of speech editing.
The experiments are comprehensive, utilizing a large-scale dataset (Libriheavy) and rigorous benchmarks for evaluation. The authors provide detailed comparisons against state-of-the-art models, demonstrating clear improvements in metrics such as WER, speaker similarity, and perceptual quality. The use of both objective and subjective metrics strengthens the evaluation, although further details on the statistical significance of results would enhance the robustness of the findings.
The paper includes sufficient implementation details, including training configurations and the architecture of the models used. However, the absence of a publicly available code repository limits full reproducibility. Providing access to the code and trained models would significantly enhance the paper's impact and allow for independent verification of results.
While the proposed method shows strong performance, the paper does not address potential limitations in terms of computational efficiency or the scalability of the approach to diverse languages or dialects. Additionally, the reliance on a pre-trained TTS model may introduce biases based on the training data used for that model.
The implications of this research are significant for various applications, including media production, accessibility technologies, and real-time speech editing in communication tools. The ability to edit speech seamlessly could enhance user experience and efficiency in numerous fields, from entertainment to education. The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform reconstruction. As a result, when operating at ultra-low bitrates, these methods degrade rapidly and often fail to preserve essential information, leading to severe acoustic artifacts and pronounced semantic distortion. To overcome these limitations, we introduce Generative Audio Compression (GAC), a novel paradigm shift from signal fidelity to task-oriented effectiveness. Implemented within the AI Flow framework, GAC is theoretically grounded in the Law of Information Capacity. These foundations posit that abundant computational power can be leveraged at the receiver to offset extreme communication bottlenecks--exemplifying the More Computation, Less Bandwidth philosophy. By integrating semantic understanding at the transmitter with scalable generative synthesis at the receiver, GAC offloads the information burden to powerful model priors. Our 1.8B-parameter model achieves high-fidelity reconstruction of 32kHz general audio at an unprecedented bitrate of 0.275kbps. Even at 0.175kbps, it still preserves a strong intelligible audio transmission capability, which represents an about 3000x compression ratio, significantly outperforming current state-of-the-art neural codecs in maintaining both perceptual quality and semantic consistency.
Primary: Institute of Artificial Intelligence, China Telecom
All Institutions: Institute of Artificial Intelligence, China Telecom
The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
The proposed Generative Audio Compression (GAC) method represents a significant shift from traditional audio compression techniques by focusing on task-oriented effectiveness rather than pure signal fidelity. The integration of semantic understanding at the transmitter and generative synthesis at the receiver is a novel approach that leverages the Law of Information Capacity to optimize the trade-off between computation and bandwidth. The methodology is well-grounded in theoretical frameworks and employs advanced techniques such as latent-variable modeling and variational objectives, showcasing a comprehensive understanding of both audio processing and machine learning principles.
The experiments are robust, covering multiple audio domains (speech, general sound, and music) and employing both objective and subjective evaluation metrics. The results demonstrate GAC's superior performance in maintaining perceptual quality and semantic consistency at extremely low bitrates, significantly outperforming existing state-of-the-art methods. The use of diverse datasets and thorough evaluation metrics strengthens the credibility of the findings.
While the paper provides a detailed description of the methodology and experimental setup, it lacks explicit implementation details or links to code repositories, which could hinder reproducibility. The absence of a demo or project URL further limits the ability for others to replicate the results.
One notable limitation is the trade-off between perceptual quality and speaker identity preservation at lower bitrates, which could affect applications requiring high fidelity in speaker recognition. Additionally, the reliance on large model sizes may limit practical deployment in resource-constrained environments.
The implications of GAC are significant for applications in low-bandwidth communication and generative audio-language modeling, potentially transforming how audio is transmitted and processed in various contexts. The approach could lead to advancements in telecommunication, streaming services, and assistive technologies, making high-quality audio accessible even in challenging bandwidth scenarios. The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.
Primary: The University of British Columbia
All Institutions: The University of British Columbia, Vector Institute
The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
The paper introduces RVCBench, a benchmark designed to evaluate the robustness of voice cloning models across various challenges. The methodology is comprehensive, covering a wide range of robustness tasks and including a significant dataset of 225 speakers and over 14,000 utterances. The authors systematically assess the performance of 11 modern voice cloning models under different conditions, which is a valuable approach to understanding the limitations of current technology. However, the paper could benefit from a more detailed explanation of how the robustness tasks were selected and the specific metrics used for evaluation.
The experiments are well-structured, with a clear focus on identifying performance gaps in voice cloning models under realistic conditions. The inclusion of various input variations and adversarial perturbations is a strong point, as it reflects real-world challenges. The results highlight significant robustness issues, which are crucial for advancing the field. However, the paper lacks a comparative analysis with existing benchmarks, which would strengthen its contributions.
The paper mentions that the project is open-sourced, which is a positive aspect for reproducibility. However, it lacks detailed implementation instructions or specific configurations used during experiments, which could hinder other researchers from replicating the results effectively.
One limitation is the potential bias in the selection of speakers and utterances, which may not represent the full diversity of voice characteristics in the real world. Additionally, while the benchmark covers various robustness tasks, it may not encompass all possible deployment scenarios that could affect voice cloning performance.
The findings of this paper have significant implications for the development of more robust voice cloning technologies, which could enhance applications in personalized speech interfaces and dubbing. By identifying and addressing robustness gaps, the research can contribute to safer and more reliable deployment of voice cloning systems in real-world applications. The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints -- scaling from short loops to 10-minute compositions -- while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities -- such as cover generation, repainting, and vocal-to-BGM conversion -- while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
The methodology introduces a hybrid architecture that combines a Language Model (LM) with a Diffusion Transformer (DiT) to generate music. The use of intrinsic reinforcement learning to align the LM's planning capabilities with the DiT's synthesis process is a notable innovation. The model's ability to generate music based on simple user queries and to personalize outputs with minimal input data is a significant advancement in the field of music generation. However, the paper could benefit from a more detailed explanation of the reinforcement learning mechanism and how it mitigates biases.
The paper claims that ACE-Step v1.5 achieves superior performance on commonly used evaluation metrics compared to existing commercial models. The reported generation times are impressive, especially for consumer hardware, and the ability to run on low VRAM is a practical advantage. However, the paper lacks detailed experimental results, including quantitative comparisons with baseline models, which would strengthen the claims made about performance and efficiency.
The availability of code, model weights, and a demo is a positive aspect, promoting reproducibility. However, the paper does not provide sufficient details on the training process, dataset specifics, or evaluation metrics used, which are crucial for other researchers to replicate the results effectively.
One limitation is the lack of extensive evaluation on diverse datasets to validate the model's performance across various music genres and styles. Additionally, the reliance on intrinsic reinforcement learning may limit the model's adaptability to more complex user preferences that external reward models could capture. The paper also does not address potential ethical considerations regarding music generation and copyright issues.
The potential applications of ACE-Step v1.5 are vast, ranging from aiding music artists in their creative processes to providing tools for content creators. Its ability to generate high-quality music quickly and with low resource requirements could democratize music production, making it accessible to a broader audience. However, the implications of AI-generated music on the music industry and artist livelihoods should be carefully considered. The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.
Primary: Tsinghua University
All Institutions: Tsinghua University, Shanda AI Research, Johns Hopkins University, Chinese Institute for Brain Research
The main contribution of this paper is the introduction of Hive, a high-quality synthetic dataset for query-based universal sound separation, which demonstrates that prioritizing data purity can lead to significant improvements in model performance with reduced computational costs. The comprehensive methodology and experimental validation provide a strong foundation for future research in audio separation and related fields.
The paper presents a novel automated pipeline for data cleaning and synthesis, addressing the critical issue of co-occurrence in audio datasets. The authors propose a comprehensive approach that includes ontology reconstruction, semantic-acoustic alignment, and a semantically consistent mixing strategy. This methodology is well-structured and demonstrates a clear understanding of the challenges in query-based universal sound separation (USS). The use of multimodal large models for semantic filtering is particularly innovative, as it enhances the purity of the training data, which is crucial for effective model training.
The experimental results are robust, showcasing the effectiveness of the Hive dataset compared to existing large-scale datasets. The authors provide thorough evaluations using multiple models, demonstrating competitive performance in separation accuracy and perceptual quality. The zero-shot generalization capabilities of models trained on Hive further validate the dataset's utility. However, while the results are promising, the paper could benefit from additional comparative analyses with more diverse datasets to strengthen the claims.
The paper includes detailed implementation details and provides access to the dataset and code, which enhances reproducibility. The authors specify the training configurations and evaluation metrics used, allowing other researchers to replicate the experiments. However, the reliance on specific multimodal models for semantic alignment may limit reproducibility if those models are not widely accessible.
One notable limitation is the potential for bias in the automated pipeline, as it relies on model-based decisions that may propagate existing biases in the training data. Additionally, while the Hive dataset is designed to mitigate co-occurrence noise, it may not fully capture the complexities of real-world acoustic environments. The authors also acknowledge the ethical implications of their work, particularly concerning privacy and misuse of the technology.
The proposed methodology and dataset have significant implications for advancing computational auditory scene analysis and making robust auditory models more accessible. The focus on data efficiency could democratize AI applications in areas like immersive audio and assistive listening. However, the potential for misuse of the technology raises ethical concerns that need to be addressed through responsible deployment and usage guidelines. The main contribution of this paper is the introduction of Hive, a high-quality synthetic dataset for query-based universal sound separation, which demonstrates that prioritizing data purity can lead to significant improvements in model performance with reduced computational costs. The comprehensive methodology and experimental validation provide a strong foundation for future research in audio separation and related fields.
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Honda Research Institute Japan
The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
The proposed CALM framework introduces a novel joint Contextual Acoustic-Linguistic Modeling approach for multi-speaker ASR, integrating target-speaker conditioning with dynamic vocabulary expansion. This end-to-end framework leverages speaker embeddings for target-speaker extraction and contextual biasing, addressing both acoustic and linguistic challenges in overlapping speech scenarios. The methodology is well-structured, employing advanced techniques such as Conformer and Transformer architectures, and includes a comprehensive loss function that combines multiple objectives to enhance performance.
The experiments are robust, utilizing multiple datasets (LibriSpeechMix, CSJMix, AMI) to validate the effectiveness of CALM across different languages and conditions. The reported results demonstrate substantial improvements in biased and unbiased word error rates, showcasing the framework's ability to enhance ASR performance in multi-speaker contexts. The use of various biasing list sizes and the detailed analysis of results provide a thorough evaluation of the framework's capabilities.
The paper provides sufficient implementation details, including architecture specifications, training procedures, and evaluation metrics. However, the lack of a public repository or demo URL limits the ease of reproducibility for external researchers. Clearer guidelines or access to the code would enhance the paper's reproducibility.
While CALM shows promising results, the paper acknowledges challenges such as increased insertion errors in conversational datasets like AMI, particularly for short utterances. The reliance on enrollment utterances may also limit practical applications in real-world scenarios where such data may not be readily available. Additionally, the performance degradation observed in certain conditions suggests that further optimization is needed for broader applicability.
The integration of acoustic and linguistic modeling in CALM has significant implications for personalized AI applications, particularly in multi-speaker ASR settings such as meetings and discussions. The advancements made could lead to more accurate transcription services, enhancing accessibility and usability in various domains, including education, business, and healthcare. The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy in which a non-streaming LLM-ASR model and our streaming model share parameters. Experiments on the AISHELL-1 and AISHELL-2 Mandarin benchmarks demonstrate that our method consistently outperforms recent streaming ASR baselines, achieving character error rates of 5.1% and 5.5%, respectively. The latency optimization results in a 62.5% reduction in average token generation delay with negligible impact on recognition accuracy
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Shaanxi Normal University, iFLYTEK Co, iFLYTEK Research
This paper presents a novel approach to streaming speech recognition that integrates large language models with advanced segmentation techniques, significantly improving both latency and accuracy in ASR systems. The comprehensive methodology and strong experimental results position this work as a meaningful contribution to the field of machine learning and speech recognition.
The proposed methodology leverages a read/write policy network integrated with monotonic chunkwise attention (MoChA) to facilitate real-time streaming ASR. This innovative approach allows for dynamic segmentation of audio inputs, which is a significant advancement over traditional methods that often rely on fixed-size audio chunks. The introduction of a minimal-latency training objective to optimize the segmentation boundaries is particularly noteworthy, as it addresses a critical challenge in streaming ASR systems. The joint training strategy that shares parameters between streaming and non-streaming models is also a clever way to enhance efficiency and performance.
The experiments conducted on the AISHELL-1 and AISHELL-2 Mandarin benchmarks are comprehensive and demonstrate the effectiveness of the proposed method. The reported character error rates (CER) of 5.1% and 5.5% are competitive, and the significant reduction in average token generation delay (62.5%) highlights the practical benefits of the approach. The use of ablation studies to validate the contributions of different components of the model adds rigor to the experimental evaluation.
The paper provides sufficient details regarding the model architecture, training strategy, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results.
One limitation of the study is the focus on Mandarin datasets, which may restrict the generalizability of the findings to other languages or dialects. Additionally, while the model shows promising results, the trade-off between latency and accuracy could be further explored, particularly in more diverse real-world scenarios.
The advancements in streaming ASR have significant implications for applications such as real-time transcription, live captioning, and interactive voice response systems. The ability to reduce latency while maintaining accuracy can enhance user experience in various settings, including education, customer service, and accessibility for individuals with hearing impairments. This paper presents a novel approach to streaming speech recognition that integrates large language models with advanced segmentation techniques, significantly improving both latency and accuracy in ASR systems. The comprehensive methodology and strong experimental results position this work as a meaningful contribution to the field of machine learning and speech recognition.
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.
Primary: Communication University of China
All Institutions: Ant Group, Communication University of China, Key Laboratory of Media Audio, Ministry of Education, State Key Laboratory of Media Convergence and Communication
The main contribution of this paper is the introduction of SDD-APALLM, a novel framework that enhances speech deepfake detection by explicitly exposing fine-grained acoustic evidence, thereby improving model robustness and interpretability. This work addresses a significant gap in the current methodologies for audio LLMs, providing a promising direction for future research in the field of audio processing and deepfake detection.
The proposed methodology, SDD-APALLM, innovatively enhances the accessibility of fine-grained acoustic evidence by integrating structured time-frequency representations alongside raw audio inputs. This approach effectively shifts the focus from semantic plausibility to acoustically grounded evidence, addressing a critical limitation in existing audio LLM-based speech deepfake detection methods. The use of Constant-Q Transform (CQT) to create visual tokens that highlight spectral structures linked to speech synthesis artifacts is particularly noteworthy, as it provides a clear mechanism for improving model interpretability and robustness.
The experiments are comprehensive, involving both in-domain and cross-domain evaluations across multiple datasets (ASVspoof2019 LA and ASVspoof2021 LA). The results demonstrate significant improvements in detection accuracy and robustness when utilizing the proposed framework, particularly under conditions where traditional models struggle. The ablation studies effectively illustrate the contributions of different modalities and reinforce the claim that explicit acoustic evidence enhances performance.
The paper provides detailed implementation information, including model architecture, training objectives, and hyperparameters, which supports reproducibility. However, the absence of a publicly accessible code repository or demo limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world audio deepfakes. Additionally, while the approach improves robustness, it may still be susceptible to novel spoofing techniques that exploit different acoustic characteristics not covered in the training data.
The implications of this research extend to various applications in security and trustworthiness of speech-based systems, such as voice authentication and content verification. By improving the detection of speech deepfakes, this work contributes to safeguarding against misinformation and enhancing the integrity of audio communications. The main contribution of this paper is the introduction of SDD-APALLM, a novel framework that enhances speech deepfake detection by explicitly exposing fine-grained acoustic evidence, thereby improving model robustness and interpretability. This work addresses a significant gap in the current methodologies for audio LLMs, providing a promising direction for future research in the field of audio processing and deepfake detection.
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of EmoShift, a lightweight activation-steering framework that significantly enhances emotional expressiveness in TTS systems while maintaining naturalness and speaker similarity. This work represents a meaningful advancement in the field of emotion-aware speech synthesis, addressing critical limitations of existing approaches and providing a foundation for future research in emotional control in TTS.
The proposed EmoShift framework introduces a novel EmoSteer layer that learns emotion-specific steering vectors, allowing for precise emotional control in TTS without retraining the base model. The methodology is well-structured, leveraging activation steering to inject emotion-specific offsets in a plug-and-play manner. This approach is innovative as it addresses the limitations of existing emotion-aware TTS systems that rely on fixed emotion embeddings or external guidance. The model's architecture is designed to be model-agnostic, which enhances its applicability across various TTS systems. The integration of objective and subjective evaluations to assess performance is commendable, providing a holistic view of the model's effectiveness.
The experimental setup is robust, utilizing a well-defined dataset (ESD) and comparing EmoShift against strong baselines, including a fully fine-tuned model and a model with the EmoSteer layer. The results demonstrate significant improvements in emotional expressiveness while maintaining naturalness and speaker similarity. The use of both objective metrics (WER, SpkSIM, DNSMOS) and subjective metrics (MOS, Emo-MOS) strengthens the evaluation, confirming the model's capabilities across multiple dimensions of TTS performance.
The paper provides sufficient details regarding the experimental setup, including training parameters, dataset partitioning, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on a specific dataset (ESD), which may affect the generalizability of the results to other languages or emotional contexts. Additionally, while the EmoSteer layer shows promise for emotional control, the paper does not explore the impact of using more diverse or compound emotions, which could enhance the model's applicability in real-world scenarios.
The EmoShift framework has significant implications for applications in virtual assistants, audiobooks, and human-machine dialogue systems, where emotional expressiveness is crucial for user engagement and interaction quality. By enabling more nuanced emotional control in TTS, this work could enhance user experiences in various domains, including education, entertainment, and accessibility. The main contribution of this paper is the introduction of EmoShift, a lightweight activation-steering framework that significantly enhances emotional expressiveness in TTS systems while maintaining naturalness and speaker similarity. This work represents a meaningful advancement in the field of emotion-aware speech synthesis, addressing critical limitations of existing approaches and providing a foundation for future research in emotional control in TTS.
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Beihang University, StepFun
The paper presents a significant contribution to the field of machine learning by addressing the challenge of stylistic consistency in role-play TTS through the innovative use of MCLP and a hybrid reward mechanism. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful advancement in the capabilities of TTS systems.
The paper introduces a novel metric, Mean Continuation Log-Probability (MCLP), which quantifies stylistic consistency in TTS systems using the capabilities of pre-trained Large Audio Language Models (LALMs). The methodology is well-structured, combining supervised fine-tuning (SFT) and reinforcement learning (RL) to optimize TTS for role-play scenarios. The integration of MCLP as both an evaluation metric and a reward signal is innovative, providing a more nuanced approach to measuring stylistic adherence in generated speech. The use of a hybrid reward function that balances style and content fidelity is a significant advancement in addressing the challenges of role-play TTS.
The experiments are comprehensive, utilizing a newly constructed RP-TTS dataset with rich annotations that enhance the evaluation of the proposed method. The results demonstrate significant improvements over strong baselines in both objective and subjective metrics, indicating the effectiveness of MCLP in real-world applications. The paper includes rigorous ablation studies that validate the necessity of each component of the proposed method, further strengthening the experimental findings.
While the paper provides detailed descriptions of the methodology and experimental setup, it lacks specific implementation details and code availability, which could hinder reproducibility. The absence of a demo or project URL further complicates efforts to replicate the results.
One limitation is the reliance on subjective evaluations, which can introduce variability based on annotator interpretation. Additionally, the paper does not address potential biases in the dataset construction process, which could affect the generalizability of the findings. The hybrid reward formulation, while innovative, may also lead to complexities in tuning the reward parameters effectively.
The advancements in expressive TTS systems have significant implications for various applications, including gaming, virtual assistants, and interactive storytelling. By improving the ability of TTS systems to maintain stylistic consistency, this work could enhance user engagement and experience in interactive media. The paper presents a significant contribution to the field of machine learning by addressing the challenge of stylistic consistency in role-play TTS through the innovative use of MCLP and a hybrid reward mechanism. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful advancement in the capabilities of TTS systems.
Recent speech enhancement (SE) models increasingly leverage self-supervised learning (SSL) representations for their rich semantic information. Typically, intermediate features are aggregated into a single representation via a lightweight adaptation module. However, most SSL models are not trained for noise robustness, which can lead to corrupted semantic representations. Moreover, the adaptation module is trained jointly with the SE model, potentially prioritizing acoustic details over semantic information, contradicting the original purpose. To address this issue, we first analyze the behavior of SSL models on noisy speech from an information-theoretic perspective. Specifically, we measure the mutual information (MI) between the corrupted SSL representations and the corresponding phoneme labels, focusing on preservation of linguistic contents. Building upon this analysis, we introduce the linguistic aggregation layer, which is pre-trained to maximize MI with phoneme labels (with optional dynamic aggregation) and then frozen during SE training. Experiments show that this decoupled approach improves Word Error Rate (WER) over jointly optimized baselines, demonstrating the benefit of explicitly aligning the adaptation module with linguistic contents.
Primary: unknown
All Institutions: unknown
This paper presents a comprehensive analysis of the aggregation of speech representations in enhancement tasks, proposing a novel linguistic aggregation layer that significantly improves performance in noisy conditions. The integration of information theory into the methodology and the empirical validation of results contribute to advancing the field of speech processing and enhancement.
The paper introduces a novel approach to speech enhancement by leveraging mutual information (MI) to analyze and optimize the aggregation of self-supervised learning (SSL) representations. The proposed linguistic aggregation layer is pre-trained to maximize MI with phoneme labels, which is a significant departure from conventional methods that prioritize acoustic fidelity. This decoupled approach allows for a more effective preservation of linguistic content, particularly in noisy conditions, showcasing a thoughtful integration of information theory with practical model architecture.
The experiments are well-structured, utilizing established datasets such as VoiceBank-DEMAND and LibriSpeech to validate the proposed methods. The results demonstrate clear improvements in Word Error Rate (WER) compared to jointly optimized baselines, indicating that the linguistic-first approach is effective. The analysis of MI across different layers and conditions provides a solid foundation for the experimental claims, though further exploration of various SSL models could enhance robustness.
The paper provides sufficient details regarding the experimental setup, including the training process and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work could benefit from sharing the implementation and data to facilitate independent validation of results.
One limitation is the potential trade-off between acoustic fidelity and linguistic preservation, as noted in the results. The paper also does not explore the scalability of the proposed methods across different languages or dialects, which could impact generalizability. Additionally, the dynamic aggregation strategy, while promising, may require further refinement to maximize performance gains.
The findings have significant implications for speech enhancement applications, particularly in environments with background noise. By improving the robustness of speech representations, this work could enhance communication technologies, assistive devices, and voice recognition systems, ultimately benefiting users in diverse settings. This paper presents a comprehensive analysis of the aggregation of speech representations in enhancement tasks, proposing a novel linguistic aggregation layer that significantly improves performance in noisy conditions. The integration of information theory into the methodology and the empirical validation of results contribute to advancing the field of speech processing and enhancement.
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.
Primary: Meituan
All Institutions: Meituan
The main contribution of this paper is the introduction of DIFFA-2, a diffusion-based large audio language model that significantly enhances audio understanding capabilities through innovative training methodologies and architectures. This work represents a meaningful step forward in the field of audio processing and understanding, showcasing the potential of diffusion models in a domain traditionally dominated by autoregressive approaches.
The methodology is robust, introducing a four-stage training curriculum that effectively combines semantic and acoustic alignment, large-scale supervised fine-tuning, and preference optimization. The dual-adapter architecture and the use of a frozen Whisper encoder are innovative, allowing for effective audio understanding. The paper also employs variance-reduced preference optimization, which is a notable contribution to the training process of diffusion models.
The experiments are comprehensive, utilizing multiple benchmarks (MMSU, MMAU, MMAR) to evaluate the model's performance across various dimensions of audio understanding. The results indicate that DIFFA-2 consistently outperforms its predecessor and competes well with strong autoregressive models, demonstrating the effectiveness of the proposed methods.
The paper provides sufficient details about the training and inference setup, including the datasets used and the training pipeline. However, the reproducibility could be enhanced with more explicit descriptions of hyperparameters and model configurations.
The paper acknowledges limitations in its training focus, particularly regarding conversational and alignment-style supervision, which affects performance on dialogue-centric benchmarks. Additionally, the model's performance on mixed-modality tasks is not as strong, indicating areas for improvement.
The advancements in audio understanding through DIFFA-2 have significant implications for applications in interactive voice assistants, audio analysis, and multimedia content understanding. The open-sourcing of the code and training pipeline also promotes further research in this area. The main contribution of this paper is the introduction of DIFFA-2, a diffusion-based large audio language model that significantly enhances audio understanding capabilities through innovative training methodologies and architectures. This work represents a meaningful step forward in the field of audio processing and understanding, showcasing the potential of diffusion models in a domain traditionally dominated by autoregressive approaches.
We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.
Primary: unknown
All Institutions: unknown
This paper presents a significant advancement in the encoding of spatial audio through a novel neural architecture that leverages cross-attention mechanisms and directional ATFs, demonstrating strong performance in challenging acoustic environments. The methodology and results contribute meaningfully to the field of audio processing and spatial audio technologies.
The paper introduces a novel deep neural network architecture that effectively encodes microphone array signals into Ambisonics using directional array transfer functions (ATFs) and cross-attention mechanisms. The separation of encoders for audio and directional responses is a significant methodological advancement, allowing for the generation of array-independent spatial audio representations. The use of cross-attention to combine features from different modalities is well-justified and aligns with contemporary trends in multi-modal learning. However, the paper could benefit from a clearer explanation of the architecture's design choices and the rationale behind specific hyperparameter selections.
The evaluation of the proposed method is thorough, utilizing simulated data across two distinct environments: a mobile phone scenario with body scattering and a free-field condition. The comparative analysis against traditional DSP methods and existing neural solutions is robust, demonstrating clear performance improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) and other Ambisonics metrics. The results are well-presented, though additional qualitative assessments, such as listening tests, would strengthen the findings.
The paper provides a detailed description of the experimental setup, including data generation, training procedures, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits reproducibility. Future work should include sharing the implementation to facilitate validation and further exploration by the community.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world scenarios. Additionally, while the model shows promising results, its generalization capabilities to various real-world microphone configurations and environments remain to be thoroughly tested. The paper also mentions that the model's performance could be enhanced by increasing the learning capacity of the encoders and decoder, indicating potential avenues for future research.
The proposed method has significant implications for spatial audio applications, particularly in immersive communication and virtual/extended reality environments. By improving the encoding of microphone array signals, this work could enhance user experiences in various consumer devices, making it relevant for industries focused on audio technology and immersive media. The ability to generalize across different microphone configurations also opens up possibilities for broader adoption in diverse applications. This paper presents a significant advancement in the encoding of spatial audio through a novel neural architecture that leverages cross-attention mechanisms and directional ATFs, demonstrating strong performance in challenging acoustic environments. The methodology and results contribute meaningfully to the field of audio processing and spatial audio technologies.
To advance immersive communication, the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge recently introduced Task 4 on Spatial Semantic Segmentation of Sound Scenes (S5). An S5 system takes a multi-channel audio mixture as input and outputs single-channel dry sources along with their corresponding class labels. Although the DCASE 2025 Challenge simplifies the task by constraining class labels in each mixture to be mutually exclusive, real-world mixtures frequently contain multiple sources from the same class. The presence of duplicated labels can significantly degrade the performance of the label-queried source separation (LQSS) model, which is the key component of many existing S5 systems, and can also limit the validity of the official evaluation metric of DCASE 2025 Task 4. To address these issues, we propose a class-aware permutation-invariant loss function that enables the LQSS model to handle queries involving duplicated labels. In addition, we redesign the S5 evaluation metric to eliminate ambiguities caused by these same-class sources. To evaluate the proposed method within the S5 system, we extend the label prediction model to support same-class labels. Experimental results demonstrate the effectiveness of the proposed methods and the robustness of the new metric on mixtures both with and without same-class sources.
Primary: unknown
All Institutions: JST Strategic International Collaborative Research Program (SICORP)
This paper presents a novel approach to handling duplicated labels in sound source separation, significantly improving the performance of systems designed for complex audio environments. The technical contributions are well-articulated, and the proposed methodologies could set a new standard in the field of audio processing and immersive communication.
The paper proposes a class-aware permutation-invariant loss function that effectively addresses the challenges posed by duplicated labels in sound source separation tasks. The methodology is well-structured, introducing modifications to existing models and metrics to enhance performance in real-world scenarios where multiple sources from the same class are present. The approach is innovative in its use of permutation-invariant training tailored to the specific context of audio segmentation, which is a significant advancement over traditional methods that do not account for label duplication.
The experiments are comprehensive, utilizing a well-defined dataset that simulates real-world conditions. The authors provide a detailed analysis of the performance of their proposed system compared to existing methods, demonstrating significant improvements in handling same-class sources. However, the paper could benefit from additional comparisons with more diverse models and datasets to further validate the robustness of the proposed approach.
The paper mentions that the source code will be released as part of the baseline system for the DCASE 2026 Challenge, which is a positive step towards reproducibility. However, the lack of specific URLs for the code repository and demo limits the immediate accessibility of the implementation details.
The paper acknowledges that the performance of the audio tagging model is still limited when estimating the number of sources and their labels simultaneously, particularly in the presence of multiple sources from the same class. Additionally, the reliance on oracle labels during training may not fully reflect real-world applications where such labels are not available.
The proposed methods have significant implications for immersive communication technologies and audio processing applications, particularly in environments where multiple sound sources coexist. The advancements in sound source separation could enhance user experiences in virtual and augmented reality applications, as well as improve accessibility in audio-based communication systems. This paper presents a novel approach to handling duplicated labels in sound source separation, significantly improving the performance of systems designed for complex audio environments. The technical contributions are well-articulated, and the proposed methodologies could set a new standard in the field of audio processing and immersive communication.