Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
Primary: Google DeepMind
All Institutions: Google DeepMind
The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
The methodology is robust, introducing PhaseCoder as a transformer-only spatial audio encoder that is microphone geometry-agnostic. The authors effectively leverage raw multichannel audio and microphone coordinates to produce spatial embeddings, which is a significant advancement over existing methods that are limited by fixed geometries. The use of a two-stage training strategy and synthetic data generation is well-justified, addressing the lack of real-world datasets. The architecture, including positional embeddings and the integration with the Gemma 3n LLM, is thoughtfully designed to enhance spatial reasoning capabilities.
The experimental evaluation is thorough, with clear benchmarks against state-of-the-art models like GI-DOAEnet. The results demonstrate that PhaseCoder achieves competitive performance on localization tasks, even outperforming existing models in certain scenarios. The evaluation of the fine-tuned LLM on spatial reasoning tasks is particularly noteworthy, showcasing the model's ability to handle complex queries related to spatial audio understanding. However, the reliance on synthetic datasets may raise questions about generalizability.
The paper provides detailed implementation details, including training configurations, data generation processes, and model architecture. While the methodology is well-documented, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate further research and validation.
The primary limitations include the assumption of static sources and the focus on single-speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, the model's performance could be impacted by the lack of explicit modeling of acoustic properties and dynamic sources. Future iterations should address these aspects to enhance robustness and applicability.
This work has significant implications for various applications, including assistive technologies for the hearing-impaired, improving human-robot interaction, and advancing embodied AI systems. By enabling spatial audio understanding across diverse devices, it promotes accessibility and adaptability in AI technologies, potentially transforming how users interact with their environments. The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Ping An Technology (Shenzhen) Co
The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
The proposed PL-Distill framework introduces a dual-level knowledge distillation approach that effectively addresses the challenges of distilling large audio-language models for speech emotion recognition. The incorporation of Attention-weighted Centered Kernel Alignment (AwCKA) is particularly innovative, as it dynamically prioritizes important audio tokens based on attention scores, thereby enhancing the alignment of audio embeddings despite dimensional mismatches. This methodological advancement is well-justified in the context of previous work and represents a significant contribution to the field of knowledge distillation in multimodal models.
The experimental evaluation is robust, utilizing three widely recognized datasets (IEMOCAP, RAVDESS, and SAVEE) to validate the effectiveness of the proposed method. The results demonstrate that PL-Distill not only compresses the teacher model significantly but also outperforms both the teacher and state-of-the-art models across all metrics. The ablation studies further substantiate the contributions of each component of the framework, providing a clear understanding of the impact of the proposed methods.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully generalize to other SER tasks or datasets. Additionally, while the method shows promise, the computational efficiency of the distillation process itself could be further explored to ensure practical applicability in real-world scenarios.
The implications of this research extend beyond speech emotion recognition, as the PL-Distill framework could be adapted for various audio-language tasks, potentially improving the efficiency of deploying large models in resource-constrained environments. The focus on effective knowledge transfer in multimodal contexts may also inspire future research in related areas. The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Honda Research Institute Japan
The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
The proposed CALM framework introduces a novel joint Contextual Acoustic-Linguistic Modeling approach for multi-speaker ASR, integrating target-speaker conditioning with dynamic vocabulary expansion. This end-to-end framework leverages speaker embeddings for target-speaker extraction and contextual biasing, addressing both acoustic and linguistic challenges in overlapping speech scenarios. The methodology is well-structured, employing advanced techniques such as Conformer and Transformer architectures, and includes a comprehensive loss function that combines multiple objectives to enhance performance.
The experiments are robust, utilizing multiple datasets (LibriSpeechMix, CSJMix, AMI) to validate the effectiveness of CALM across different languages and conditions. The reported results demonstrate substantial improvements in biased and unbiased word error rates, showcasing the framework's ability to enhance ASR performance in multi-speaker contexts. The use of various biasing list sizes and the detailed analysis of results provide a thorough evaluation of the framework's capabilities.
The paper provides sufficient implementation details, including architecture specifications, training procedures, and evaluation metrics. However, the lack of a public repository or demo URL limits the ease of reproducibility for external researchers. Clearer guidelines or access to the code would enhance the paper's reproducibility.
While CALM shows promising results, the paper acknowledges challenges such as increased insertion errors in conversational datasets like AMI, particularly for short utterances. The reliance on enrollment utterances may also limit practical applications in real-world scenarios where such data may not be readily available. Additionally, the performance degradation observed in certain conditions suggests that further optimization is needed for broader applicability.
The integration of acoustic and linguistic modeling in CALM has significant implications for personalized AI applications, particularly in multi-speaker ASR settings such as meetings and discussions. The advancements made could lead to more accurate transcription services, enhancing accessibility and usability in various domains, including education, business, and healthcare. The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
The emergence of Large Audio-Language Models (LALMs) has advanced Speech Emotion Recognition (SER), but their size limits deployment in resource-constrained environments. While Knowledge Distillation is effective for LALM compression, existing methods remain underexplored in distilling the cross-modal projection module (Projector), and often struggle with alignment due to differences in feature dimensions. We propose PL-Distill, a KD framework that combines Projector-Level Distillation (PDist) to align audio embeddings and Logits-Level Distillation (LDist) to align output logits. PDist introduces Attention-weighted Centered Kernel Alignment, a novel approach we propose to highlight important time steps and address dimension mismatches. Meanwhile, LDist minimizes the Kullback-Leibler divergence between teacher and student logits from audio and text modalities. On IEMOCAP, RAVDESS, and SAVEE, PL-Distill compresses an 8.4B-parameter teacher to a compact 1.1B-parameter student, consistently outperforming the teacher, state-of-the-art pretrained models, and other KD baselines across all metrics.
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Ping An Technology (Shenzhen) Co
The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
The proposed PL-Distill framework introduces a dual-level knowledge distillation approach that effectively addresses the challenges of distilling large audio-language models for speech emotion recognition. The incorporation of Attention-weighted Centered Kernel Alignment (AwCKA) is particularly innovative, as it dynamically prioritizes important audio tokens based on attention scores, thereby enhancing the alignment of audio embeddings despite dimensional mismatches. This methodological advancement is well-justified in the context of previous work and represents a significant contribution to the field of knowledge distillation in multimodal models.
The experimental evaluation is robust, utilizing three widely recognized datasets (IEMOCAP, RAVDESS, and SAVEE) to validate the effectiveness of the proposed method. The results demonstrate that PL-Distill not only compresses the teacher model significantly but also outperforms both the teacher and state-of-the-art models across all metrics. The ablation studies further substantiate the contributions of each component of the framework, providing a clear understanding of the impact of the proposed methods.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which are essential for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully generalize to other SER tasks or datasets. Additionally, while the method shows promise, the computational efficiency of the distillation process itself could be further explored to ensure practical applicability in real-world scenarios.
The implications of this research extend beyond speech emotion recognition, as the PL-Distill framework could be adapted for various audio-language tasks, potentially improving the efficiency of deploying large models in resource-constrained environments. The focus on effective knowledge transfer in multimodal contexts may also inspire future research in related areas. The paper presents a novel knowledge distillation framework, PL-Distill, that effectively compresses large audio-language models for speech emotion recognition while maintaining high performance. The innovative methodologies and comprehensive experimental evaluations contribute significantly to the advancement of knowledge distillation techniques in multimodal machine learning.
We propose HuPER, a human-inspired framework that models phonetic perception as adaptive inference over acoustic-phonetics evidence and linguistic knowledge. With only 100 hours of training data, HuPER achieves state-of-the-art phonetic error rates on five English benchmarks and strong zero-shot transfer to 95 unseen languages. HuPER is also the first framework to enable adaptive, multi-path phonetic perception under diverse acoustic conditions. All training data, models, and code are open-sourced. Code and demo avaliable at https://github.com/HuPER29/HuPER.
Primary: University of California, Berkeley
All Institutions: Zhejiang University, University of California, Berkeley
HuPER presents a novel framework for phonetic perception that integrates adaptive inference with acoustic and linguistic knowledge, achieving state-of-the-art performance with limited training data. The methodology is robust, and the implications for practical applications in speech technology are substantial, marking a significant advancement in the field.
The methodology proposed in HuPER is innovative as it frames phonetic perception as adaptive inference, integrating both acoustic-phonetic evidence and linguistic knowledge. The four-stage training pipeline is well-structured, starting from a small annotated corpus and leveraging a larger transcript-only corpus for pseudo-label generation. The use of a Corrector model to learn edit operations is particularly noteworthy, as it enhances the robustness of the phonetic recognizer. This adaptive approach allows for multi-path phonetic perception under varying acoustic conditions, which is a significant advancement in the field.
The experiments conducted are comprehensive, with the framework achieving state-of-the-art phonetic error rates across five English benchmarks and demonstrating strong zero-shot transfer capabilities to 95 unseen languages. The choice of datasets and benchmarks appears appropriate for validating the performance claims. However, more detailed comparisons with existing state-of-the-art methods would strengthen the evaluation.
The authors have made all training data, models, and code open-sourced, which is commendable and enhances reproducibility. The provided GitHub repository allows other researchers to replicate the experiments and build upon the work. However, additional documentation on the training process and hyperparameter settings would further facilitate reproducibility.
One limitation of the study is the reliance on the initial small human-annotated corpus, which may not capture the full diversity of phonetic variations across different languages. Additionally, while the zero-shot transfer to 95 languages is impressive, the paper does not provide extensive analysis on the performance across these languages, which could vary significantly in phonetic structure.
The potential applications of HuPER are vast, particularly in assistive technologies for education, healthcare, and accessibility. By improving the reliability of phonetic representations, the framework could lead to more effective communication tools for diverse populations. The work also lays a foundation for future developments in speech generation systems, making it a significant contribution to the field of speech and language technologies. HuPER presents a novel framework for phonetic perception that integrates adaptive inference with acoustic and linguistic knowledge, achieving state-of-the-art performance with limited training data. The methodology is robust, and the implications for practical applications in speech technology are substantial, marking a significant advancement in the field.
Lip-to-speech synthesis aims to generate speech audio directly from silent facial video by reconstructing linguistic content from lip movements, providing valuable applications in situations where audio signals are unavailable or degraded. While recent diffusion-based models such as LipVoicer have demonstrated impressive performance in reconstructing linguistic content, they often lack prosodic consistency. In this work, we propose LipSody, a lip-to-speech framework enhanced for prosody consistency. LipSody introduces a prosody-guiding strategy that leverages three complementary cues: speaker identity extracted from facial images, linguistic content derived from lip movements, and emotional context inferred from face video. Experimental results demonstrate that LipSody substantially improves prosody-related metrics, including global and local pitch deviations, energy consistency, and speaker similarity, compared to prior approaches.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this work is the introduction of LipSody, a novel lip-to-speech synthesis framework that enhances prosody consistency through a multi-faceted approach to visual input. This paper represents a meaningful advancement in the field of audio synthesis, providing a robust methodology and comprehensive evaluation that could influence future research and applications in multimodal speech synthesis.
The methodology presented in LipSody is innovative, leveraging a diffusion-based framework to enhance prosody consistency in lip-to-speech synthesis. The authors introduce a novel prosody-guiding strategy that integrates speaker identity, linguistic content, and emotional context, which is a significant advancement over previous models that primarily focused on intelligibility. The use of complementary cues for prosody estimation is a thoughtful approach that enhances the model's ability to generate more natural and expressive speech. The architecture is well-structured, utilizing established deep learning techniques while introducing new components like the Emotion Encoder to refine prosody prediction.
The experimental evaluation is thorough, utilizing a large dataset (LRS3) and employing both objective and subjective metrics to assess performance. The results demonstrate significant improvements in prosody-related metrics compared to prior models, while maintaining intelligibility. The use of statistical tests to validate the significance of improvements adds rigor to the findings. However, the paper could benefit from additional comparisons with more recent models beyond LipVoicer to contextualize its contributions further.
The paper provides detailed implementation specifics, including model architecture, training protocols, and evaluation metrics, which support reproducibility. The authors mention using publicly available codebases for components like the Emotion Encoder and vocoder, which enhances the potential for others to replicate their work. However, the lack of a publicly available code repository for the entire LipSody framework limits full reproducibility.
One limitation is the reliance on the LRS3 dataset, which may not encompass the full diversity of lip movements and emotional expressions found in real-world scenarios. Additionally, while the model shows improvements in prosody consistency, the subjective evaluations indicate that the differences in naturalness are not statistically significant, suggesting that further enhancements could be explored. The model's performance in diverse acoustic environments or with different speaker demographics remains untested.
LipSody has significant potential applications in areas such as assistive technologies for the hearing impaired, silent communication tools, and enhancing multimedia content accessibility. The ability to generate expressive and personalized speech from visual input could also benefit virtual avatars and gaming industries, where realistic character interactions are crucial. The advancements in prosody consistency could lead to more engaging and relatable AI-generated speech, fostering better human-computer interactions. The main contribution of this work is the introduction of LipSody, a novel lip-to-speech synthesis framework that enhances prosody consistency through a multi-faceted approach to visual input. This paper represents a meaningful advancement in the field of audio synthesis, providing a robust methodology and comprehensive evaluation that could influence future research and applications in multimodal speech synthesis.
Supervised speech enhancement methods have been very successful. However, in practical scenarios, there is a lack of clean speech, and self-supervised learning-based (SSL) speech enhancement methods that offer comparable enhancement performance and can be applied to other speech-related downstream applications are desired. In this work, we develop a masked autoencoder based universal speech enhancer that is agnostic to the type of distortion affecting speech, can handle multiple distortions simultaneously, and is trained in a self-supervised manner. An augmentation stack adds further distortions to the noisy input data. The masked autoencoder model learns to remove the added distortions along with reconstructing the masked regions of the spectrogram during pre-training. The pre-trained embeddings are then used by fine-tuning models trained on a small amount of paired data for specific downstream tasks. We evaluate the pre-trained features for denoising and dereverberation downstream tasks. We explore different augmentations (like single or multi-speaker) in the pre-training augmentation stack and the effect of different noisy input feature representations (like $log1p$ compression) on pre-trained embeddings and downstream fine-tuning enhancement performance. We show that the proposed method not only outperforms the baseline but also achieves state-of-the-art performance for both in-domain and out-of-domain evaluation datasets.
Primary: University of Illinois, Urbana-Champaign
All Institutions: University of Illinois, Urbana-Champaign, AWS AI Labs
The main contribution of this paper is the development of a masked autoencoder framework for universal speech enhancement that effectively handles multiple distortions through self-supervised learning. This work presents a novel approach that not only advances the state of the art in speech enhancement but also opens avenues for further research in self-supervised learning applications in audio processing.
The paper introduces a masked autoencoder framework for speech enhancement that is both self-supervised and capable of handling various distortions. The methodology is well-structured, leveraging an augmentation stack to introduce additional noise, which is a clever approach to pre-training. The dual focus on denoising and dereverberation tasks demonstrates versatility. However, the paper could benefit from a more thorough comparison with existing methods beyond the baseline, as well as a clearer explanation of the specific architecture choices made in the masked autoencoder.
The experiments are comprehensive, evaluating the model on both in-domain and out-of-domain datasets, which is crucial for assessing generalizability. The results indicate that the proposed method achieves state-of-the-art performance, which is a significant contribution. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance testing, which would strengthen the claims made.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. While the methodology is described, the absence of a clear protocol for reproducing the results limits the ability of other researchers to validate the findings.
One limitation is the reliance on a small amount of paired data for fine-tuning, which may not be feasible in all practical scenarios. Additionally, the paper does not address potential biases in the datasets used for evaluation, which could affect the generalizability of the results.
The proposed method has significant implications for real-world applications in speech enhancement, particularly in environments with varying types of noise. The ability to enhance speech across different distortions makes it a valuable tool for improving communication in challenging acoustic settings, such as in teleconferencing or assistive technologies for the hearing impaired. The main contribution of this paper is the development of a masked autoencoder framework for universal speech enhancement that effectively handles multiple distortions through self-supervised learning. This work presents a novel approach that not only advances the state of the art in speech enhancement but also opens avenues for further research in self-supervised learning applications in audio processing.
Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper proposes a novel parallel generative speech enhancement (ParaGSE) framework that leverages a group vector quantization (GVQ)-based neural speech codec. The GVQ-based codec adopts separate VQs to produce mutually independent tokens, enabling efficient parallel token prediction in ParaGSE. Specifically, ParaGSE leverages the GVQ-based codec to encode degraded speech into distinct tokens, predicts the corresponding clean tokens through parallel branches conditioned on degraded spectral features, and ultimately reconstructs clean speech via the codec decoder. Experimental results demonstrate that ParaGSE consistently produces superior enhanced speech compared to both discriminative and generative baselines, under a wide range of distortions including noise, reverberation, band-limiting, and their mixtures. Furthermore, empowered by parallel computation in token prediction, ParaGSE attains about a 1.5-fold improvement in generation efficiency on CPU compared with serial generative speech enhancement approaches.
Primary: University of Science and Technology of China
All Institutions: National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China
The paper presents ParaGSE, a novel framework for parallel generative speech enhancement that leverages a GVQ-based neural speech codec to achieve significant improvements in speech quality and processing efficiency. The technical contributions are substantial, addressing key challenges in the field and demonstrating the potential for practical applications in real-world scenarios.
The proposed methodology, ParaGSE, introduces a novel framework for generative speech enhancement that utilizes a group vector quantization (GVQ)-based neural speech codec. This approach is innovative in its use of separate VQs for independent token generation, which facilitates efficient parallel computation. The architecture is well-structured, employing a combination of convolutional layers, BiLSTM, and Conformer blocks to extract features and predict clean tokens. The methodology is sound, with a clear explanation of the components and their interactions, although it could benefit from more detailed comparisons with existing methods in terms of computational complexity.
The experimental evaluation is robust, featuring a comprehensive set of experiments that assess the performance of ParaGSE against various baseline models across multiple distortion types. The paper includes both objective and subjective metrics, providing a well-rounded view of the model's effectiveness. The dataset construction is thorough, utilizing real-world noise and reverberation conditions, which enhances the relevance of the findings. However, the paper could improve by including more detailed statistical analyses of the results and discussing the significance of the findings more explicitly.
The paper provides sufficient implementation details, including the architecture, training criteria, and experimental setup, which aids in reproducibility. The availability of codes and speech samples on the provided URL is a positive aspect, although the lack of a direct GitHub repository may limit accessibility for some researchers.
One limitation is the potential complexity of the model, which may hinder deployment in real-time applications. Additionally, while the paper claims efficiency improvements, it does not provide a detailed comparison of the computational costs associated with the proposed method versus the baselines, which could be crucial for practical applications. There is also a noted performance gap in intrusive metrics like LSD compared to discriminative models, which could be a concern for certain applications.
The proposed ParaGSE framework has significant potential for real-world applications in speech enhancement, particularly in environments with various distortions. Its efficiency and ability to produce high-quality speech restoration could benefit communication technologies, assistive listening devices, and speech recognition systems. The advancements in generative models for speech enhancement also contribute to the broader field of audio processing and machine learning. The paper presents ParaGSE, a novel framework for parallel generative speech enhancement that leverages a GVQ-based neural speech codec to achieve significant improvements in speech quality and processing efficiency. The technical contributions are substantial, addressing key challenges in the field and demonstrating the potential for practical applications in real-world scenarios.
Existing generative models for unsupervised anomalous sound detection are limited by their inability to fully capture the complex feature distribution of normal sounds, while the potential of powerful diffusion models in this domain remains largely unexplored. To address this challenge, we propose a novel framework, TLDiffGAN, which consists of two complementary branches. One branch incorporates a latent diffusion model into the GAN generator for adversarial training, thereby making the discriminator's task more challenging and improving the quality of generated samples. The other branch leverages pretrained audio model encoders to extract features directly from raw audio waveforms for auxiliary discrimination. This framework effectively captures feature representations of normal sounds from both raw audio and Mel spectrograms. Moreover, we introduce a TMixup spectrogram augmentation technique to enhance sensitivity to subtle and localized temporal patterns that are often overlooked. Extensive experiments on the DCASE 2020 Challenge Task 2 dataset demonstrate the superior detection performance of TLDiffGAN, as well as its strong capability in anomalous time-frequency localization.
Primary: Tsinghua University
All Institutions: Tsinghua University, Dalian Maritime University, Shenzhen International Graduate School
The main contribution of this paper is the introduction of TLDiffGAN, a novel framework that integrates latent diffusion models with GANs for improved anomalous sound detection. This work significantly advances the state of the art in the field by addressing key limitations of existing generative models and demonstrating superior performance through rigorous experimental validation.
The proposed TLDiffGAN framework innovatively combines latent diffusion models with GANs to enhance the quality of generated spectrograms for anomalous sound detection. The dual-branch architecture effectively integrates features from both raw audio and Mel spectrograms, addressing the limitations of traditional single-modality approaches. The introduction of the TMixup technique to augment temporal features is a significant methodological advancement, enhancing the model's sensitivity to subtle anomalies. However, the complexity of the model may pose challenges in terms of interpretability and practical deployment.
The experiments conducted on the DCASE 2020 Challenge Task 2 dataset are extensive and demonstrate a clear improvement over existing methods in terms of AUC and pAUC metrics. The comparative analysis with other state-of-the-art methods provides strong evidence for the effectiveness of TLDiffGAN. The ablation studies further validate the contributions of each component, reinforcing the robustness of the proposed framework.
The paper provides detailed implementation details, including network configurations, training protocols, and evaluation metrics, which support reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which others can replicate the results.
One limitation is the reliance on a specific dataset (DCASE 2020) for evaluation, which may not fully capture the generalizability of the model across different domains or types of anomalous sounds. Additionally, the model's complexity could lead to challenges in real-time applications, particularly in resource-constrained environments.
The framework has significant implications for industrial applications, particularly in predictive maintenance and monitoring of machinery, where timely detection of anomalies can prevent failures and reduce downtime. The ability to localize anomalies in the time-frequency domain enhances interpretability, which is crucial for practitioners in the field. The main contribution of this paper is the introduction of TLDiffGAN, a novel framework that integrates latent diffusion models with GANs for improved anomalous sound detection. This work significantly advances the state of the art in the field by addressing key limitations of existing generative models and demonstrating superior performance through rigorous experimental validation.
Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.
Primary: University of Melbourne
All Institutions: University of Melbourne
The main contribution of this paper is the introduction of HierCon, a hierarchical contrastive attention framework that significantly improves audio deepfake detection by effectively modeling temporal and inter-layer dependencies, thereby achieving state-of-the-art performance on benchmark datasets. This work represents a meaningful advancement in the field, addressing critical challenges in distinguishing between real and synthetic audio.
The paper introduces HierCon, a novel hierarchical layer attention framework that effectively captures temporal and inter-layer dependencies in audio deepfake detection. The methodology is well-structured, employing a three-stage attention mechanism that enhances the model's ability to discern subtle differences between real and synthetic audio. The integration of margin-based contrastive learning is particularly noteworthy, as it encourages the model to develop domain-invariant embeddings, thereby improving generalization across various deepfake generation techniques. The detailed explanation of the attention mechanism and the loss functions used provides a solid foundation for understanding the proposed approach.
The authors conduct thorough experiments on multiple datasets, including ASVspoof 2021 DF and In-the-Wild, demonstrating significant improvements over existing methods. The reported results, including Equal Error Rates (EER), clearly indicate the effectiveness of HierCon, achieving state-of-the-art performance. The inclusion of ablation studies further strengthens the findings, allowing for a clear understanding of the contributions of hierarchical attention and contrastive learning to the overall performance.
While the paper provides a comprehensive description of the methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a demo or project URL also limits the ability for others to validate the findings independently.
One limitation of the study is the reliance on specific datasets for evaluation, which may not fully capture the diversity of real-world audio deepfakes. Additionally, while the hierarchical attention mechanism is promising, the complexity of the model may pose challenges in terms of computational efficiency and scalability for real-time applications.
The implications of this research are significant, particularly in the context of security and online trust, as audio deepfakes pose increasing risks in various domains, including voice authentication and digital forensics. The proposed method has the potential to enhance the robustness of detection systems, contributing to the development of more secure communication technologies. The main contribution of this paper is the introduction of HierCon, a hierarchical contrastive attention framework that significantly improves audio deepfake detection by effectively modeling temporal and inter-layer dependencies, thereby achieving state-of-the-art performance on benchmark datasets. This work represents a meaningful advancement in the field, addressing critical challenges in distinguishing between real and synthetic audio.
Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.
Primary: unknown
All Institutions: unknown
The paper presents a novel adaptation framework for multilingual speech recognition that leverages a structured analysis of layer-wise adaptability, significantly improving efficiency and performance in low-resource language settings. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential to advance speech technology accessibility.
The proposed Depth-Aware Model Adaptation (DAMA) framework introduces a novel approach to multilingual ASR by analyzing layer-wise adaptability and implementing a U-shaped adaptability pattern. This structured adaptation strategy effectively allocates training resources, enhancing efficiency and performance in low-resource language scenarios. The integration of SVD-based initialization and Basis-Protected Projection further solidifies the method's robustness, allowing for effective adaptation while preserving essential language-agnostic representations.
The experiments conducted on 18 low-resource languages using two benchmark datasets (Common Voice and FLEURS) demonstrate the effectiveness of DAMA. The results indicate that DAMA not only matches or surpasses state-of-the-art performance but also significantly reduces the number of trainable parameters and computational costs. The thorough evaluation across different languages and settings adds credibility to the findings, showcasing the framework's adaptability and efficiency.
The paper provides detailed implementation details, including the datasets used, experimental setup, and hyperparameter settings, which facilitate reproducibility. However, the lack of a publicly available code repository limits the ease of replication for external researchers.
While the study reveals significant findings, it is limited to 18 languages, and the generalizability of the U-shaped adaptability pattern across even more diverse languages remains to be tested. Additionally, the method is optimized for low-resource settings, which may not translate to high-resource scenarios without further adjustments.
The findings have the potential to significantly enhance multilingual speech recognition technologies, particularly for low-resource languages, thereby promoting inclusivity in speech technology applications. This could lead to broader accessibility and usability of speech recognition systems in diverse linguistic contexts. The paper presents a novel adaptation framework for multilingual speech recognition that leverages a structured analysis of layer-wise adaptability, significantly improving efficiency and performance in low-resource language settings. The comprehensive evaluation of the proposed methodology and its implications for the field highlight its potential to advance speech technology accessibility.
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
Primary: Inria, LIRMM, Université de Montpellier
All Institutions: Inria, LIRMM, Université de Montpellier, Earth Species Project, University of Kassel
The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
The methodology presented in this paper is innovative as it proposes a contrastive distillation approach to bridge audio and image modalities without requiring paired data. By leveraging a pretrained image-text model (BioCLIP-2) to enhance the audio-text model (BioLingual), the authors effectively create a semantic intermediary that facilitates meaningful audio-to-image retrieval. The use of a contrastive objective for fine-tuning the audio encoder is well-justified and demonstrates a clear understanding of the underlying challenges in cross-modal representation learning. The simplicity of the approach, which avoids complex multi-objective training and direct image supervision, is a significant strength.
The experiments are robust, utilizing multiple bioacoustic benchmarks to validate the effectiveness of the proposed method. The results indicate that the distilled audio encoder not only improves audio-to-image retrieval performance but also preserves the discriminative capabilities of the audio model. The comparisons against various baselines, including zero-shot and text-embedding mapping strategies, provide a comprehensive evaluation of the method's effectiveness. The use of independent datasets for validation strengthens the credibility of the findings.
The paper mentions that the code will be publicly available after review, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics, such as hyperparameter settings, training duration, and computational resources, which are essential for other researchers to replicate the experiments fully.
One limitation of the study is the reliance on the quality and representativeness of the textual descriptions used for training the audio encoder. If the textual descriptions are not sufficiently diverse or comprehensive, it may impact the generalization of the model. Additionally, while the approach demonstrates strong performance on the evaluated datasets, its applicability to other domains or species not represented in the training data remains uncertain.
The implications of this research are significant for biodiversity monitoring and conservation efforts, particularly in scenarios where paired audio-image data is scarce. By enabling effective audio-to-image retrieval, the proposed method can assist researchers and conservationists in identifying species based on audio recordings, thus enhancing ecological studies and wildlife conservation strategies. The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology, Tsinghua University, Sun Yat-sen University
The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
The proposed Dual-View Predictive Diffusion (DVPD) model introduces a novel approach to speech enhancement by leveraging the dual nature of spectrograms as both visual textures and physical frequency-domain representations. The methodology includes the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which effectively preserves critical low-frequency harmonics while reducing high-frequency redundancies, and the Lightweight Image-based Spectro-Awareness (LISA) module, which captures features from a visual perspective. The Training-free Lossless Boost (TLB) strategy further enhances the model's performance during inference without additional training, showcasing a well-thought-out integration of predictive and generative paradigms.
The experiments are extensive, covering various benchmarks including WSJ0-UNI and VBDMD, demonstrating the model's effectiveness across different distortion scenarios. The results indicate that DVPD achieves state-of-the-art performance while significantly reducing computational complexity compared to existing models. The comprehensive evaluation metrics used, such as PESQ and ESTOI, provide a robust assessment of the model's capabilities.
The paper includes detailed implementation details, including training configurations, loss functions, and evaluation metrics, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers.
While the model demonstrates impressive performance, it may still struggle with certain types of distortions not covered in the training datasets. Additionally, the reliance on specific hyperparameters for the TLB strategy may introduce variability in performance across different applications.
The advancements presented in this paper have significant implications for real-world applications in speech enhancement, particularly in noisy environments. The lightweight nature of the model makes it suitable for deployment in resource-constrained settings, potentially benefiting various industries, including telecommunications and assistive technologies. The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints -- we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
Primary: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences
All Institutions: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Department of Automation, Tsinghua University, Beijing National Research Center for Information Science and Technology, Tsinghua University
The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
The proposed methodology introduces a novel framework for text-based speech editing that effectively decouples semantic content from acoustic features, addressing the limitations of existing methods that often lead to artifacts and instability. The use of a Flow Matching decoder for acoustic reconstruction and a Self-Consistency Rewards mechanism for perceptual alignment is innovative and well-justified, leveraging a pre-trained TTS model as an implicit critic. This dual-stage approach enhances both intelligibility and naturalness, making significant strides in the field of speech editing.
The experiments are comprehensive, utilizing a large-scale dataset (Libriheavy) and rigorous benchmarks for evaluation. The authors provide detailed comparisons against state-of-the-art models, demonstrating clear improvements in metrics such as WER, speaker similarity, and perceptual quality. The use of both objective and subjective metrics strengthens the evaluation, although further details on the statistical significance of results would enhance the robustness of the findings.
The paper includes sufficient implementation details, including training configurations and the architecture of the models used. However, the absence of a publicly available code repository limits full reproducibility. Providing access to the code and trained models would significantly enhance the paper's impact and allow for independent verification of results.
While the proposed method shows strong performance, the paper does not address potential limitations in terms of computational efficiency or the scalability of the approach to diverse languages or dialects. Additionally, the reliance on a pre-trained TTS model may introduce biases based on the training data used for that model.
The implications of this research are significant for various applications, including media production, accessibility technologies, and real-time speech editing in communication tools. The ability to edit speech seamlessly could enhance user experience and efficiency in numerous fields, from entertainment to education. The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform reconstruction. As a result, when operating at ultra-low bitrates, these methods degrade rapidly and often fail to preserve essential information, leading to severe acoustic artifacts and pronounced semantic distortion. To overcome these limitations, we introduce Generative Audio Compression (GAC), a novel paradigm shift from signal fidelity to task-oriented effectiveness. Implemented within the AI Flow framework, GAC is theoretically grounded in the Law of Information Capacity. These foundations posit that abundant computational power can be leveraged at the receiver to offset extreme communication bottlenecks--exemplifying the More Computation, Less Bandwidth philosophy. By integrating semantic understanding at the transmitter with scalable generative synthesis at the receiver, GAC offloads the information burden to powerful model priors. Our 1.8B-parameter model achieves high-fidelity reconstruction of 32kHz general audio at an unprecedented bitrate of 0.275kbps. Even at 0.175kbps, it still preserves a strong intelligible audio transmission capability, which represents an about 3000x compression ratio, significantly outperforming current state-of-the-art neural codecs in maintaining both perceptual quality and semantic consistency.
Primary: Institute of Artificial Intelligence, China Telecom
All Institutions: Institute of Artificial Intelligence, China Telecom
The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
The proposed Generative Audio Compression (GAC) method represents a significant shift from traditional audio compression techniques by focusing on task-oriented effectiveness rather than pure signal fidelity. The integration of semantic understanding at the transmitter and generative synthesis at the receiver is a novel approach that leverages the Law of Information Capacity to optimize the trade-off between computation and bandwidth. The methodology is well-grounded in theoretical frameworks and employs advanced techniques such as latent-variable modeling and variational objectives, showcasing a comprehensive understanding of both audio processing and machine learning principles.
The experiments are robust, covering multiple audio domains (speech, general sound, and music) and employing both objective and subjective evaluation metrics. The results demonstrate GAC's superior performance in maintaining perceptual quality and semantic consistency at extremely low bitrates, significantly outperforming existing state-of-the-art methods. The use of diverse datasets and thorough evaluation metrics strengthens the credibility of the findings.
While the paper provides a detailed description of the methodology and experimental setup, it lacks explicit implementation details or links to code repositories, which could hinder reproducibility. The absence of a demo or project URL further limits the ability for others to replicate the results.
One notable limitation is the trade-off between perceptual quality and speaker identity preservation at lower bitrates, which could affect applications requiring high fidelity in speaker recognition. Additionally, the reliance on large model sizes may limit practical deployment in resource-constrained environments.
The implications of GAC are significant for applications in low-bandwidth communication and generative audio-language modeling, potentially transforming how audio is transmitted and processed in various contexts. The approach could lead to advancements in telecommunication, streaming services, and assistive technologies, making high-quality audio accessible even in challenging bandwidth scenarios. The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.
Primary: The University of British Columbia
All Institutions: The University of British Columbia, Vector Institute
The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
The paper introduces RVCBench, a benchmark designed to evaluate the robustness of voice cloning models across various challenges. The methodology is comprehensive, covering a wide range of robustness tasks and including a significant dataset of 225 speakers and over 14,000 utterances. The authors systematically assess the performance of 11 modern voice cloning models under different conditions, which is a valuable approach to understanding the limitations of current technology. However, the paper could benefit from a more detailed explanation of how the robustness tasks were selected and the specific metrics used for evaluation.
The experiments are well-structured, with a clear focus on identifying performance gaps in voice cloning models under realistic conditions. The inclusion of various input variations and adversarial perturbations is a strong point, as it reflects real-world challenges. The results highlight significant robustness issues, which are crucial for advancing the field. However, the paper lacks a comparative analysis with existing benchmarks, which would strengthen its contributions.
The paper mentions that the project is open-sourced, which is a positive aspect for reproducibility. However, it lacks detailed implementation instructions or specific configurations used during experiments, which could hinder other researchers from replicating the results effectively.
One limitation is the potential bias in the selection of speakers and utterances, which may not represent the full diversity of voice characteristics in the real world. Additionally, while the benchmark covers various robustness tasks, it may not encompass all possible deployment scenarios that could affect voice cloning performance.
The findings of this paper have significant implications for the development of more robust voice cloning technologies, which could enhance applications in personalized speech interfaces and dubbing. By identifying and addressing robustness gaps, the research can contribute to safer and more reliable deployment of voice cloning systems in real-world applications. The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints -- scaling from short loops to 10-minute compositions -- while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities -- such as cover generation, repainting, and vocal-to-BGM conversion -- while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
The methodology introduces a hybrid architecture that combines a Language Model (LM) with a Diffusion Transformer (DiT) to generate music. The use of intrinsic reinforcement learning to align the LM's planning capabilities with the DiT's synthesis process is a notable innovation. The model's ability to generate music based on simple user queries and to personalize outputs with minimal input data is a significant advancement in the field of music generation. However, the paper could benefit from a more detailed explanation of the reinforcement learning mechanism and how it mitigates biases.
The paper claims that ACE-Step v1.5 achieves superior performance on commonly used evaluation metrics compared to existing commercial models. The reported generation times are impressive, especially for consumer hardware, and the ability to run on low VRAM is a practical advantage. However, the paper lacks detailed experimental results, including quantitative comparisons with baseline models, which would strengthen the claims made about performance and efficiency.
The availability of code, model weights, and a demo is a positive aspect, promoting reproducibility. However, the paper does not provide sufficient details on the training process, dataset specifics, or evaluation metrics used, which are crucial for other researchers to replicate the results effectively.
One limitation is the lack of extensive evaluation on diverse datasets to validate the model's performance across various music genres and styles. Additionally, the reliance on intrinsic reinforcement learning may limit the model's adaptability to more complex user preferences that external reward models could capture. The paper also does not address potential ethical considerations regarding music generation and copyright issues.
The potential applications of ACE-Step v1.5 are vast, ranging from aiding music artists in their creative processes to providing tools for content creators. Its ability to generate high-quality music quickly and with low resource requirements could democratize music production, making it accessible to a broader audience. However, the implications of AI-generated music on the music industry and artist livelihoods should be carefully considered. The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.
Primary: Tsinghua University
All Institutions: Tsinghua University, Shanda AI Research, Johns Hopkins University, Chinese Institute for Brain Research
The main contribution of this paper is the introduction of Hive, a high-quality synthetic dataset for query-based universal sound separation, which demonstrates that prioritizing data purity can lead to significant improvements in model performance with reduced computational costs. The comprehensive methodology and experimental validation provide a strong foundation for future research in audio separation and related fields.
The paper presents a novel automated pipeline for data cleaning and synthesis, addressing the critical issue of co-occurrence in audio datasets. The authors propose a comprehensive approach that includes ontology reconstruction, semantic-acoustic alignment, and a semantically consistent mixing strategy. This methodology is well-structured and demonstrates a clear understanding of the challenges in query-based universal sound separation (USS). The use of multimodal large models for semantic filtering is particularly innovative, as it enhances the purity of the training data, which is crucial for effective model training.
The experimental results are robust, showcasing the effectiveness of the Hive dataset compared to existing large-scale datasets. The authors provide thorough evaluations using multiple models, demonstrating competitive performance in separation accuracy and perceptual quality. The zero-shot generalization capabilities of models trained on Hive further validate the dataset's utility. However, while the results are promising, the paper could benefit from additional comparative analyses with more diverse datasets to strengthen the claims.
The paper includes detailed implementation details and provides access to the dataset and code, which enhances reproducibility. The authors specify the training configurations and evaluation metrics used, allowing other researchers to replicate the experiments. However, the reliance on specific multimodal models for semantic alignment may limit reproducibility if those models are not widely accessible.
One notable limitation is the potential for bias in the automated pipeline, as it relies on model-based decisions that may propagate existing biases in the training data. Additionally, while the Hive dataset is designed to mitigate co-occurrence noise, it may not fully capture the complexities of real-world acoustic environments. The authors also acknowledge the ethical implications of their work, particularly concerning privacy and misuse of the technology.
The proposed methodology and dataset have significant implications for advancing computational auditory scene analysis and making robust auditory models more accessible. The focus on data efficiency could democratize AI applications in areas like immersive audio and assistive listening. However, the potential for misuse of the technology raises ethical concerns that need to be addressed through responsible deployment and usage guidelines. The main contribution of this paper is the introduction of Hive, a high-quality synthetic dataset for query-based universal sound separation, which demonstrates that prioritizing data purity can lead to significant improvements in model performance with reduced computational costs. The comprehensive methodology and experimental validation provide a strong foundation for future research in audio separation and related fields.
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Honda Research Institute Japan
The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
The proposed CALM framework introduces a novel joint Contextual Acoustic-Linguistic Modeling approach for multi-speaker ASR, integrating target-speaker conditioning with dynamic vocabulary expansion. This end-to-end framework leverages speaker embeddings for target-speaker extraction and contextual biasing, addressing both acoustic and linguistic challenges in overlapping speech scenarios. The methodology is well-structured, employing advanced techniques such as Conformer and Transformer architectures, and includes a comprehensive loss function that combines multiple objectives to enhance performance.
The experiments are robust, utilizing multiple datasets (LibriSpeechMix, CSJMix, AMI) to validate the effectiveness of CALM across different languages and conditions. The reported results demonstrate substantial improvements in biased and unbiased word error rates, showcasing the framework's ability to enhance ASR performance in multi-speaker contexts. The use of various biasing list sizes and the detailed analysis of results provide a thorough evaluation of the framework's capabilities.
The paper provides sufficient implementation details, including architecture specifications, training procedures, and evaluation metrics. However, the lack of a public repository or demo URL limits the ease of reproducibility for external researchers. Clearer guidelines or access to the code would enhance the paper's reproducibility.
While CALM shows promising results, the paper acknowledges challenges such as increased insertion errors in conversational datasets like AMI, particularly for short utterances. The reliance on enrollment utterances may also limit practical applications in real-world scenarios where such data may not be readily available. Additionally, the performance degradation observed in certain conditions suggests that further optimization is needed for broader applicability.
The integration of acoustic and linguistic modeling in CALM has significant implications for personalized AI applications, particularly in multi-speaker ASR settings such as meetings and discussions. The advancements made could lead to more accurate transcription services, enhancing accessibility and usability in various domains, including education, business, and healthcare. The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy in which a non-streaming LLM-ASR model and our streaming model share parameters. Experiments on the AISHELL-1 and AISHELL-2 Mandarin benchmarks demonstrate that our method consistently outperforms recent streaming ASR baselines, achieving character error rates of 5.1% and 5.5%, respectively. The latency optimization results in a 62.5% reduction in average token generation delay with negligible impact on recognition accuracy
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Shaanxi Normal University, iFLYTEK Co, iFLYTEK Research
This paper presents a novel approach to streaming speech recognition that integrates large language models with advanced segmentation techniques, significantly improving both latency and accuracy in ASR systems. The comprehensive methodology and strong experimental results position this work as a meaningful contribution to the field of machine learning and speech recognition.
The proposed methodology leverages a read/write policy network integrated with monotonic chunkwise attention (MoChA) to facilitate real-time streaming ASR. This innovative approach allows for dynamic segmentation of audio inputs, which is a significant advancement over traditional methods that often rely on fixed-size audio chunks. The introduction of a minimal-latency training objective to optimize the segmentation boundaries is particularly noteworthy, as it addresses a critical challenge in streaming ASR systems. The joint training strategy that shares parameters between streaming and non-streaming models is also a clever way to enhance efficiency and performance.
The experiments conducted on the AISHELL-1 and AISHELL-2 Mandarin benchmarks are comprehensive and demonstrate the effectiveness of the proposed method. The reported character error rates (CER) of 5.1% and 5.5% are competitive, and the significant reduction in average token generation delay (62.5%) highlights the practical benefits of the approach. The use of ablation studies to validate the contributions of different components of the model adds rigor to the experimental evaluation.
The paper provides sufficient details regarding the model architecture, training strategy, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results.
One limitation of the study is the focus on Mandarin datasets, which may restrict the generalizability of the findings to other languages or dialects. Additionally, while the model shows promising results, the trade-off between latency and accuracy could be further explored, particularly in more diverse real-world scenarios.
The advancements in streaming ASR have significant implications for applications such as real-time transcription, live captioning, and interactive voice response systems. The ability to reduce latency while maintaining accuracy can enhance user experience in various settings, including education, customer service, and accessibility for individuals with hearing impairments. This paper presents a novel approach to streaming speech recognition that integrates large language models with advanced segmentation techniques, significantly improving both latency and accuracy in ASR systems. The comprehensive methodology and strong experimental results position this work as a meaningful contribution to the field of machine learning and speech recognition.
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.
Primary: Communication University of China
All Institutions: Ant Group, Communication University of China, Key Laboratory of Media Audio, Ministry of Education, State Key Laboratory of Media Convergence and Communication
The main contribution of this paper is the introduction of SDD-APALLM, a novel framework that enhances speech deepfake detection by explicitly exposing fine-grained acoustic evidence, thereby improving model robustness and interpretability. This work addresses a significant gap in the current methodologies for audio LLMs, providing a promising direction for future research in the field of audio processing and deepfake detection.
The proposed methodology, SDD-APALLM, innovatively enhances the accessibility of fine-grained acoustic evidence by integrating structured time-frequency representations alongside raw audio inputs. This approach effectively shifts the focus from semantic plausibility to acoustically grounded evidence, addressing a critical limitation in existing audio LLM-based speech deepfake detection methods. The use of Constant-Q Transform (CQT) to create visual tokens that highlight spectral structures linked to speech synthesis artifacts is particularly noteworthy, as it provides a clear mechanism for improving model interpretability and robustness.
The experiments are comprehensive, involving both in-domain and cross-domain evaluations across multiple datasets (ASVspoof2019 LA and ASVspoof2021 LA). The results demonstrate significant improvements in detection accuracy and robustness when utilizing the proposed framework, particularly under conditions where traditional models struggle. The ablation studies effectively illustrate the contributions of different modalities and reinforce the claim that explicit acoustic evidence enhances performance.
The paper provides detailed implementation information, including model architecture, training objectives, and hyperparameters, which supports reproducibility. However, the absence of a publicly accessible code repository or demo limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world audio deepfakes. Additionally, while the approach improves robustness, it may still be susceptible to novel spoofing techniques that exploit different acoustic characteristics not covered in the training data.
The implications of this research extend to various applications in security and trustworthiness of speech-based systems, such as voice authentication and content verification. By improving the detection of speech deepfakes, this work contributes to safeguarding against misinformation and enhancing the integrity of audio communications. The main contribution of this paper is the introduction of SDD-APALLM, a novel framework that enhances speech deepfake detection by explicitly exposing fine-grained acoustic evidence, thereby improving model robustness and interpretability. This work addresses a significant gap in the current methodologies for audio LLMs, providing a promising direction for future research in the field of audio processing and deepfake detection.
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of EmoShift, a lightweight activation-steering framework that significantly enhances emotional expressiveness in TTS systems while maintaining naturalness and speaker similarity. This work represents a meaningful advancement in the field of emotion-aware speech synthesis, addressing critical limitations of existing approaches and providing a foundation for future research in emotional control in TTS.
The proposed EmoShift framework introduces a novel EmoSteer layer that learns emotion-specific steering vectors, allowing for precise emotional control in TTS without retraining the base model. The methodology is well-structured, leveraging activation steering to inject emotion-specific offsets in a plug-and-play manner. This approach is innovative as it addresses the limitations of existing emotion-aware TTS systems that rely on fixed emotion embeddings or external guidance. The model's architecture is designed to be model-agnostic, which enhances its applicability across various TTS systems. The integration of objective and subjective evaluations to assess performance is commendable, providing a holistic view of the model's effectiveness.
The experimental setup is robust, utilizing a well-defined dataset (ESD) and comparing EmoShift against strong baselines, including a fully fine-tuned model and a model with the EmoSteer layer. The results demonstrate significant improvements in emotional expressiveness while maintaining naturalness and speaker similarity. The use of both objective metrics (WER, SpkSIM, DNSMOS) and subjective metrics (MOS, Emo-MOS) strengthens the evaluation, confirming the model's capabilities across multiple dimensions of TTS performance.
The paper provides sufficient details regarding the experimental setup, including training parameters, dataset partitioning, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on a specific dataset (ESD), which may affect the generalizability of the results to other languages or emotional contexts. Additionally, while the EmoSteer layer shows promise for emotional control, the paper does not explore the impact of using more diverse or compound emotions, which could enhance the model's applicability in real-world scenarios.
The EmoShift framework has significant implications for applications in virtual assistants, audiobooks, and human-machine dialogue systems, where emotional expressiveness is crucial for user engagement and interaction quality. By enabling more nuanced emotional control in TTS, this work could enhance user experiences in various domains, including education, entertainment, and accessibility. The main contribution of this paper is the introduction of EmoShift, a lightweight activation-steering framework that significantly enhances emotional expressiveness in TTS systems while maintaining naturalness and speaker similarity. This work represents a meaningful advancement in the field of emotion-aware speech synthesis, addressing critical limitations of existing approaches and providing a foundation for future research in emotional control in TTS.
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Beihang University, StepFun
The paper presents a significant contribution to the field of machine learning by addressing the challenge of stylistic consistency in role-play TTS through the innovative use of MCLP and a hybrid reward mechanism. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful advancement in the capabilities of TTS systems.
The paper introduces a novel metric, Mean Continuation Log-Probability (MCLP), which quantifies stylistic consistency in TTS systems using the capabilities of pre-trained Large Audio Language Models (LALMs). The methodology is well-structured, combining supervised fine-tuning (SFT) and reinforcement learning (RL) to optimize TTS for role-play scenarios. The integration of MCLP as both an evaluation metric and a reward signal is innovative, providing a more nuanced approach to measuring stylistic adherence in generated speech. The use of a hybrid reward function that balances style and content fidelity is a significant advancement in addressing the challenges of role-play TTS.
The experiments are comprehensive, utilizing a newly constructed RP-TTS dataset with rich annotations that enhance the evaluation of the proposed method. The results demonstrate significant improvements over strong baselines in both objective and subjective metrics, indicating the effectiveness of MCLP in real-world applications. The paper includes rigorous ablation studies that validate the necessity of each component of the proposed method, further strengthening the experimental findings.
While the paper provides detailed descriptions of the methodology and experimental setup, it lacks specific implementation details and code availability, which could hinder reproducibility. The absence of a demo or project URL further complicates efforts to replicate the results.
One limitation is the reliance on subjective evaluations, which can introduce variability based on annotator interpretation. Additionally, the paper does not address potential biases in the dataset construction process, which could affect the generalizability of the findings. The hybrid reward formulation, while innovative, may also lead to complexities in tuning the reward parameters effectively.
The advancements in expressive TTS systems have significant implications for various applications, including gaming, virtual assistants, and interactive storytelling. By improving the ability of TTS systems to maintain stylistic consistency, this work could enhance user engagement and experience in interactive media. The paper presents a significant contribution to the field of machine learning by addressing the challenge of stylistic consistency in role-play TTS through the innovative use of MCLP and a hybrid reward mechanism. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful advancement in the capabilities of TTS systems.
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.
Primary: Meituan
All Institutions: Meituan
The main contribution of this paper is the introduction of DIFFA-2, a diffusion-based large audio language model that significantly enhances audio understanding capabilities through innovative training methodologies and architectures. This work represents a meaningful step forward in the field of audio processing and understanding, showcasing the potential of diffusion models in a domain traditionally dominated by autoregressive approaches.
The methodology is robust, introducing a four-stage training curriculum that effectively combines semantic and acoustic alignment, large-scale supervised fine-tuning, and preference optimization. The dual-adapter architecture and the use of a frozen Whisper encoder are innovative, allowing for effective audio understanding. The paper also employs variance-reduced preference optimization, which is a notable contribution to the training process of diffusion models.
The experiments are comprehensive, utilizing multiple benchmarks (MMSU, MMAU, MMAR) to evaluate the model's performance across various dimensions of audio understanding. The results indicate that DIFFA-2 consistently outperforms its predecessor and competes well with strong autoregressive models, demonstrating the effectiveness of the proposed methods.
The paper provides sufficient details about the training and inference setup, including the datasets used and the training pipeline. However, the reproducibility could be enhanced with more explicit descriptions of hyperparameters and model configurations.
The paper acknowledges limitations in its training focus, particularly regarding conversational and alignment-style supervision, which affects performance on dialogue-centric benchmarks. Additionally, the model's performance on mixed-modality tasks is not as strong, indicating areas for improvement.
The advancements in audio understanding through DIFFA-2 have significant implications for applications in interactive voice assistants, audio analysis, and multimedia content understanding. The open-sourcing of the code and training pipeline also promotes further research in this area. The main contribution of this paper is the introduction of DIFFA-2, a diffusion-based large audio language model that significantly enhances audio understanding capabilities through innovative training methodologies and architectures. This work represents a meaningful step forward in the field of audio processing and understanding, showcasing the potential of diffusion models in a domain traditionally dominated by autoregressive approaches.
We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.
Primary: unknown
All Institutions: unknown
This paper presents a significant advancement in the encoding of spatial audio through a novel neural architecture that leverages cross-attention mechanisms and directional ATFs, demonstrating strong performance in challenging acoustic environments. The methodology and results contribute meaningfully to the field of audio processing and spatial audio technologies.
The paper introduces a novel deep neural network architecture that effectively encodes microphone array signals into Ambisonics using directional array transfer functions (ATFs) and cross-attention mechanisms. The separation of encoders for audio and directional responses is a significant methodological advancement, allowing for the generation of array-independent spatial audio representations. The use of cross-attention to combine features from different modalities is well-justified and aligns with contemporary trends in multi-modal learning. However, the paper could benefit from a clearer explanation of the architecture's design choices and the rationale behind specific hyperparameter selections.
The evaluation of the proposed method is thorough, utilizing simulated data across two distinct environments: a mobile phone scenario with body scattering and a free-field condition. The comparative analysis against traditional DSP methods and existing neural solutions is robust, demonstrating clear performance improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) and other Ambisonics metrics. The results are well-presented, though additional qualitative assessments, such as listening tests, would strengthen the findings.
The paper provides a detailed description of the experimental setup, including data generation, training procedures, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits reproducibility. Future work should include sharing the implementation to facilitate validation and further exploration by the community.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world scenarios. Additionally, while the model shows promising results, its generalization capabilities to various real-world microphone configurations and environments remain to be thoroughly tested. The paper also mentions that the model's performance could be enhanced by increasing the learning capacity of the encoders and decoder, indicating potential avenues for future research.
The proposed method has significant implications for spatial audio applications, particularly in immersive communication and virtual/extended reality environments. By improving the encoding of microphone array signals, this work could enhance user experiences in various consumer devices, making it relevant for industries focused on audio technology and immersive media. The ability to generalize across different microphone configurations also opens up possibilities for broader adoption in diverse applications. This paper presents a significant advancement in the encoding of spatial audio through a novel neural architecture that leverages cross-attention mechanisms and directional ATFs, demonstrating strong performance in challenging acoustic environments. The methodology and results contribute meaningfully to the field of audio processing and spatial audio technologies.
To advance immersive communication, the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge recently introduced Task 4 on Spatial Semantic Segmentation of Sound Scenes (S5). An S5 system takes a multi-channel audio mixture as input and outputs single-channel dry sources along with their corresponding class labels. Although the DCASE 2025 Challenge simplifies the task by constraining class labels in each mixture to be mutually exclusive, real-world mixtures frequently contain multiple sources from the same class. The presence of duplicated labels can significantly degrade the performance of the label-queried source separation (LQSS) model, which is the key component of many existing S5 systems, and can also limit the validity of the official evaluation metric of DCASE 2025 Task 4. To address these issues, we propose a class-aware permutation-invariant loss function that enables the LQSS model to handle queries involving duplicated labels. In addition, we redesign the S5 evaluation metric to eliminate ambiguities caused by these same-class sources. To evaluate the proposed method within the S5 system, we extend the label prediction model to support same-class labels. Experimental results demonstrate the effectiveness of the proposed methods and the robustness of the new metric on mixtures both with and without same-class sources.
Primary: unknown
All Institutions: JST Strategic International Collaborative Research Program (SICORP)
This paper presents a novel approach to handling duplicated labels in sound source separation, significantly improving the performance of systems designed for complex audio environments. The technical contributions are well-articulated, and the proposed methodologies could set a new standard in the field of audio processing and immersive communication.
The paper proposes a class-aware permutation-invariant loss function that effectively addresses the challenges posed by duplicated labels in sound source separation tasks. The methodology is well-structured, introducing modifications to existing models and metrics to enhance performance in real-world scenarios where multiple sources from the same class are present. The approach is innovative in its use of permutation-invariant training tailored to the specific context of audio segmentation, which is a significant advancement over traditional methods that do not account for label duplication.
The experiments are comprehensive, utilizing a well-defined dataset that simulates real-world conditions. The authors provide a detailed analysis of the performance of their proposed system compared to existing methods, demonstrating significant improvements in handling same-class sources. However, the paper could benefit from additional comparisons with more diverse models and datasets to further validate the robustness of the proposed approach.
The paper mentions that the source code will be released as part of the baseline system for the DCASE 2026 Challenge, which is a positive step towards reproducibility. However, the lack of specific URLs for the code repository and demo limits the immediate accessibility of the implementation details.
The paper acknowledges that the performance of the audio tagging model is still limited when estimating the number of sources and their labels simultaneously, particularly in the presence of multiple sources from the same class. Additionally, the reliance on oracle labels during training may not fully reflect real-world applications where such labels are not available.
The proposed methods have significant implications for immersive communication technologies and audio processing applications, particularly in environments where multiple sound sources coexist. The advancements in sound source separation could enhance user experiences in virtual and augmented reality applications, as well as improve accessibility in audio-based communication systems. This paper presents a novel approach to handling duplicated labels in sound source separation, significantly improving the performance of systems designed for complex audio environments. The technical contributions are well-articulated, and the proposed methodologies could set a new standard in the field of audio processing and immersive communication.
Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of the Convolutional Audio Transformer (CAT), which effectively addresses the limitations of existing self-supervised learning methods in audio understanding by incorporating a multi-resolution approach and representation regularization. This work represents a meaningful step forward in the field, combining innovative methodology with rigorous experimental validation to enhance audio representation learning.
The proposed Convolutional Audio Transformer (CAT) introduces a Multi-resolution Block to capture hierarchical audio features, which is a significant advancement over existing methods that typically operate at a single level of granularity. The incorporation of a Representation Regularization objective is innovative, as it aligns the student model's predictions with high-quality semantic representations from pre-trained external encoders. This approach not only enhances the model's training efficiency but also bridges the gap between audio and language representations, which is a novel contribution to the field of audio understanding.
The experiments conducted on multiple audio understanding benchmarks, including AudioSet, ESC-50, and Speech Commands V2, demonstrate the effectiveness of CAT. The reported results show significant improvements over baseline models, particularly in terms of convergence speed and performance metrics. The use of various datasets and the comparison against state-of-the-art models strengthen the credibility of the findings. However, more details on the experimental setup and statistical significance of the results would enhance the evaluation.
The paper mentions that codes and checkpoints will be released, which is a positive aspect for reproducibility. However, the detailed hyperparameter settings and training configurations provided in the tables are essential for others to replicate the experiments accurately. The clarity of these details is crucial for ensuring that the research can be reproduced by the community.
One limitation is the reliance on pre-trained external encoders, which may limit the model's applicability in scenarios where such resources are not available. Additionally, while the model shows improved performance, the computational efficiency and scalability of the approach in real-world applications need further exploration. The paper could also benefit from a more thorough discussion on the potential biases in the datasets used.
The advancements made in audio understanding through the CAT framework have significant implications for various applications, including automated audio captioning, sound event detection, and human-computer interaction. By improving the efficiency and effectiveness of audio representation learning, this research could lead to more robust audio processing systems in diverse domains such as entertainment, surveillance, and accessibility technologies. The main contribution of this paper is the introduction of the Convolutional Audio Transformer (CAT), which effectively addresses the limitations of existing self-supervised learning methods in audio understanding by incorporating a multi-resolution approach and representation regularization. This work represents a meaningful step forward in the field, combining innovative methodology with rigorous experimental validation to enhance audio representation learning.
In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: https://semanticaudio1.github.io/
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Shanghai Jiao Tong University
The main contribution of this work is the introduction of the SemanticAudio framework, which decouples semantic planning from acoustic synthesis, achieving superior semantic alignment and enabling training-free audio editing. This innovative approach addresses critical limitations in existing text-to-audio generation models and has the potential to significantly impact the field of audio synthesis and editing.
The proposed SemanticAudio framework introduces a two-stage Flow Matching architecture that effectively separates the semantic planning of audio content from the acoustic synthesis process. This decoupling allows for improved semantic alignment with textual prompts, addressing a significant limitation in existing models that operate directly in acoustic latent spaces. The methodology is well-structured, leveraging pre-trained models for both semantic and acoustic representations, and introduces a novel training-free editing mechanism that enhances user control over audio attributes. The use of velocity fields for guiding the generation process is particularly innovative and demonstrates a solid understanding of the underlying principles of generative modeling.
The experiments conducted are extensive and rigorously designed, utilizing the AudioCaps dataset to evaluate both text-to-audio generation and training-free editing capabilities. The paper provides clear metrics for assessing performance, including CLAP scores for semantic alignment, Fréchet Distance for fidelity, and Inception Score for diversity. The results indicate that SemanticAudio outperforms existing state-of-the-art methods, validating the proposed approach. However, the reliance on a single dataset for training and evaluation may limit the generalizability of the findings.
The paper includes detailed implementation specifics, including architecture choices, training protocols, and evaluation metrics, which facilitate reproducibility. The use of established frameworks and pre-trained models further aids in replicating the results. However, the absence of a public code repository may hinder full reproducibility for some researchers.
The paper acknowledges limitations related to the dataset size and the potential challenges in generalizing the model to longer audio sequences or more complex acoustic scenarios. Additionally, the evaluation of editing capabilities relies on proxy metrics, which may not fully capture the subjective quality of the audio modifications. Future work is needed to address these limitations and explore broader datasets.
The SemanticAudio framework has significant implications for various applications in creative industries, such as film, gaming, and virtual reality, where high-quality audio generation and editing are crucial. The ability to manipulate audio attributes without retraining models can streamline workflows for sound designers and enhance user experiences in interactive environments. The research contributes to the growing field of generative audio models, pushing the boundaries of what is possible in text-to-audio synthesis. The main contribution of this work is the introduction of the SemanticAudio framework, which decouples semantic planning from acoustic synthesis, achieving superior semantic alignment and enabling training-free audio editing. This innovative approach addresses critical limitations in existing text-to-audio generation models and has the potential to significantly impact the field of audio synthesis and editing.
Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Carnegie Mellon University
Sylber 2.0 presents a significant advancement in speech modeling by introducing a universal syllable embedding framework that efficiently captures linguistic and acoustic details across multiple languages. The comprehensive methodology, rigorous experimental evaluation, and potential for broad applications underscore its importance in the field of machine learning and audio processing.
The methodology presented in Sylber 2.0 is robust and innovative, leveraging self-supervised learning to create syllable embeddings that effectively capture both linguistic and acoustic details across multiple languages. The introduction of a boundary detector and an auxiliary acoustic encoder enhances the model's ability to generate high-fidelity speech while maintaining a low token frequency. The multi-stage training process and the careful design of the encoding-decoding framework demonstrate a thorough understanding of the challenges in speech modeling.
The experiments conducted are comprehensive and well-structured, covering a wide range of languages and styles. The results indicate that Sylber 2.0 achieves competitive performance in terms of intelligibility and quality compared to state-of-the-art models, even with a significantly reduced parameter count. The evaluation metrics used, such as WER and STOI, provide a clear picture of the model's effectiveness in real-world applications.
The paper provides detailed implementation details, including training data sources and hyperparameter settings, which enhance reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to reproduce the results independently.
One limitation is the reliance on the quality of the training data, as the model's performance may vary significantly with different datasets. Additionally, while the model is designed for multilingual applications, the performance in low-resource languages could be further explored to assess its generalizability. The potential for misuse in generating misleading audio also raises ethical concerns that need to be addressed.
The implications of this research are significant, particularly in the fields of text-to-speech (TTS) and automatic speech recognition (ASR). By providing a more efficient and universal method for speech tokenization, Sylber 2.0 could enhance accessibility and usability in various applications, including language learning, assistive technologies, and multilingual communication. However, ethical considerations regarding the misuse of synthesized speech must be taken into account. Sylber 2.0 presents a significant advancement in speech modeling by introducing a universal syllable embedding framework that efficiently captures linguistic and acoustic details across multiple languages. The comprehensive methodology, rigorous experimental evaluation, and potential for broad applications underscore its importance in the field of machine learning and audio processing.
The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field's reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice, and specifically curated to isolate the effect of language switching across approximately 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate on cross-language trials. By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies, directly aligning with the Interspeech 2026 theme, "Speaking Together."
Primary: University of Zurich
All Institutions: University of Zurich, Indiana University, Mozilla Foundation, Otto-von-Guericke-University Magdeburg
The TidyVoice Challenge aims to advance cross-lingual speaker verification research by providing a structured evaluation framework and a curated multilingual dataset. This comprehensive analysis highlights the challenge's innovative approach, rigorous methodology, and potential implications for the field of machine learning.
The methodology proposed in the TidyVoice Challenge is well-structured and addresses a significant gap in speaker verification research, particularly focusing on cross-lingual scenarios. The challenge is designed to evaluate systems under controlled conditions with a clear definition of tasks, training, and test conditions. The use of the TidyVoiceX dataset, which is specifically curated to isolate language switching effects, adds robustness to the methodology. The evaluation metrics, including Equal Error Rate (EER) and Minimum Detection Cost Function (minDCF), are appropriate for the task and provide a comprehensive assessment of system performance.
The paper outlines a rigorous evaluation plan that includes a clear delineation of training and evaluation phases, as well as the use of a baseline system for comparison. The challenge's design ensures that participants are tested on their ability to generalize to unseen languages, which is critical for assessing the robustness of speaker verification systems. However, the paper does not provide empirical results or preliminary findings, which could have strengthened the evaluation of the proposed challenge.
The challenge emphasizes reproducibility by requiring participants to submit detailed system descriptions and trained models. This is a positive aspect, as it encourages transparency and allows for independent verification of results. The provision of a baseline system and evaluation scripts further enhances reproducibility, although the actual implementation details of the baseline system are not fully elaborated in the paper.
One limitation of the challenge is the reliance on the Mozilla Common Voice dataset, which may have inherent biases or limitations in terms of speaker diversity and language representation. Additionally, the challenge does not address potential issues related to the quality of the audio recordings, which could impact the performance of the systems developed by participants.
The TidyVoice Challenge has the potential to significantly impact the field of speaker verification by promoting research that is more inclusive and representative of diverse languages. By focusing on cross-lingual verification, the challenge aligns with broader goals of fairness and accessibility in machine learning technologies. The outcomes of this challenge could lead to advancements in language-independent speaker recognition systems, benefiting various applications in security, telecommunications, and human-computer interaction. The TidyVoice Challenge aims to advance cross-lingual speaker verification research by providing a structured evaluation framework and a curated multilingual dataset. This comprehensive analysis highlights the challenge's innovative approach, rigorous methodology, and potential implications for the field of machine learning.
While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model (LLM)-based agent for post-ASR correction: a Judge-Editor over the top-k ASR hypotheses that keeps high-confidence spans, rewrites uncertain segments, and operates in both zero-shot and fine-tuned modes. In parallel, we release SAP-Hypo5, the largest benchmark for dysarthric speech correction, to enable reproducibility and future exploration. Under multi-perspective evaluation, our agent achieves a 14.51% WER reduction alongside substantial semantic gains, including a +7.59 pp improvement in MENLI and +7.66 pp in Slot Micro F1 on challenging samples. Our analysis further reveals that WER is highly sensitive to domain shift, whereas semantic metrics correlate more closely with downstream task performance.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
This paper presents a significant advancement in the field of dysarthric speech recognition by proposing a robust LLM-based post-ASR correction method that prioritizes semantic fidelity over traditional metrics. The combination of innovative methodology and comprehensive evaluation positions this work as a valuable contribution to both the machine learning and speech recognition communities.
The paper introduces a novel approach to post-ASR correction for dysarthric speech using a Large Language Model (LLM) as a Judge-Editor. This method is significant as it operates on the top-k ASR hypotheses, allowing for the retention of high-confidence segments while rewriting uncertain parts. The dual operational modes (zero-shot and fine-tuned) enhance its applicability across various scenarios. The integration of semantic fidelity metrics alongside traditional WER represents a meaningful shift in how ASR systems are evaluated, particularly for populations with unique speech characteristics.
The authors provide a comprehensive evaluation of their method using the newly released SAP-Hypo5 dataset, which is the largest benchmark for dysarthric speech correction. The reported 14.51% reduction in WER, alongside improvements in semantic metrics (MENLI and Slot Micro F1), indicates robust experimental design and results. The multi-perspective evaluation approach adds depth to the analysis, showing that traditional metrics can be misleading in specific contexts, particularly for dysarthric speech.
The authors emphasize the importance of reproducibility by releasing the SAP-Hypo5 dataset, which is crucial for future research in this area. However, the paper lacks specific details regarding the implementation of the LLM-agent and whether the code or models will be made publicly available, which could hinder full reproducibility.
While the paper presents a strong methodology and results, it does not address potential limitations in the generalizability of the LLM-agent across different dialects or languages of dysarthric speech. Additionally, the reliance on top-k hypotheses may introduce biases based on the ASR system used, which could affect the outcomes.
The implications of this research are significant, particularly for improving communication aids for individuals with dysarthria. By enhancing the accuracy of ASR systems in this context, the work could lead to better accessibility tools, ultimately improving the quality of life for affected individuals. The focus on semantic fidelity also sets a precedent for future research in ASR applications beyond dysarthric speech. This paper presents a significant advancement in the field of dysarthric speech recognition by proposing a robust LLM-based post-ASR correction method that prioritizes semantic fidelity over traditional metrics. The combination of innovative methodology and comprehensive evaluation positions this work as a valuable contribution to both the machine learning and speech recognition communities.
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.
Primary: Wuhan University
All Institutions: Wuhan University, Anhui University, Communication University of China, Beihang University, Independent Researcher
The paper presents a comprehensive approach to speech editing detection and content localization through the development of the PELM framework, significantly advancing the field of audio processing and detection. The innovative methodology, combined with robust experimental validation, positions this work as a valuable contribution to combating the challenges posed by advanced audio manipulation techniques.
The paper introduces a novel framework, PELM, that combines speech editing detection and content localization by treating them as an audio question answering task. The incorporation of a word-level probabilistic prior and an acoustic consistency-aware loss is innovative, addressing biases in existing audio large language models. The methodology is well-structured, leveraging a large-scale bilingual dataset (AiEdit) that enhances the robustness of the model against advanced speech editing techniques.
The experiments are thorough, comparing PELM against several state-of-the-art methods on both the HumanEdit and AiEdit datasets. The results demonstrate significant improvements in detection and localization tasks, with detailed metrics such as Equal Error Rate (EER) showing PELM's superiority. The ablation studies provide insights into the contributions of each component of the framework, reinforcing the validity of the proposed methods.
The paper provides sufficient implementation details, including model architectures, training configurations, and hyperparameters, which support reproducibility. The authors have also made the dataset publicly available, facilitating further research in this area.
One limitation is the reliance on the quality of the underlying large language models, which may affect the performance of PELM. Additionally, while the dataset is extensive, it may not cover all possible speech editing scenarios, potentially limiting the generalizability of the findings.
The research addresses critical issues related to audio deepfakes and misinformation, making it highly relevant in today's digital landscape. The ability to detect and localize speech edits has significant implications for security, privacy, and the integrity of information dissemination. The paper presents a comprehensive approach to speech editing detection and content localization through the development of the PELM framework, significantly advancing the field of audio processing and detection. The innovative methodology, combined with robust experimental validation, positions this work as a valuable contribution to combating the challenges posed by advanced audio manipulation techniques.
We propose a brain-informed speech separation method for cochlear implants (CIs) that uses electroencephalography (EEG)-derived attention cues to guide enhancement toward the attended speaker. An attention-guided network fuses audio mixtures with EEG features through a lightweight fusion layer, producing attended-source electrodograms for CI stimulation while resolving the label-permutation ambiguity of audio-only separators. Robustness to degraded attention cues is improved with a mixed curriculum that varies cue quality during training, yielding stable gains even when EEG-speech correlation is moderate. In multi-talker conditions, the model achieves higher signal-to-interference ratio improvements than an audio-only electrodogram baseline while remaining slightly smaller (167k vs. 171k parameters). With 2 ms algorithmic latency and comparable cost, the approach highlights the promise of coupling auditory and neural cues for cognitively adaptive CI processing.
Primary: unknown
All Institutions: unknown
The paper presents a novel brain-informed speech separation method for cochlear implants, demonstrating significant improvements over traditional audio-only approaches. The integration of EEG-derived attention cues and a robust training methodology highlights its potential to enhance speech intelligibility in complex auditory environments, marking a meaningful contribution to the field of machine learning and auditory processing.
The proposed methodology integrates EEG-derived attention cues with audio processing in a lightweight neural network architecture, addressing a significant challenge in cochlear implant (CI) technology. The attention-guided network effectively resolves label-permutation ambiguity by producing a single attended electrodogram, which is a notable advancement over traditional audio-only approaches. The use of curriculum learning to enhance robustness against degraded cues is a clever strategy that reflects a deep understanding of the practical challenges in real-world applications. However, the reliance on a proxy attention cue rather than real EEG data is a limitation that could affect the generalizability of the results.
The experimental evaluation is thorough, comparing the proposed model against a strong baseline in various conditions. The results demonstrate significant improvements in signal-to-interference ratio (SIR) across different input conditions, indicating the effectiveness of the proposed method. The analysis of cue correlation and its impact on performance provides valuable insights into the robustness of the model. However, the experiments could benefit from additional real-world testing with actual CI users to validate the findings further.
The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, along with a link to the open-source implementation. This transparency enhances reproducibility, allowing other researchers to replicate the study and build upon the findings. However, the absence of real EEG data in the training and evaluation phases may limit the reproducibility of results in practical scenarios.
Key limitations include the use of a proxy attention cue instead of real EEG data, which may not fully capture the complexities of actual neural signals. Additionally, while the mixed curriculum learning approach shows promise, the model's performance in highly variable real-world environments remains untested. Future work should address these limitations by incorporating real EEG data and evaluating the model's performance in more complex auditory scenes.
The research has significant implications for improving speech perception in cochlear implant users, particularly in challenging listening environments such as multi-talker scenarios. By leveraging brain-computer interface techniques, this work opens avenues for more cognitively adaptive auditory processing systems, potentially enhancing the quality of life for individuals with hearing impairments. The findings could also inspire further research into multimodal integration in various applications beyond cochlear implants. The paper presents a novel brain-informed speech separation method for cochlear implants, demonstrating significant improvements over traditional audio-only approaches. The integration of EEG-derived attention cues and a robust training methodology highlights its potential to enhance speech intelligibility in complex auditory environments, marking a meaningful contribution to the field of machine learning and auditory processing.
Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is further adopted to achieve optimal phonetic accuracy. Third, we achieve a single-step efficient reverse process in inference with a novel quantization error mask initialization strategy, which, according to our knowledge, is the first successful single-step diffusion speech enhancement based on an audio codec. Trained and evaluated on URGENT 2024 Speech Enhancement Challenge data splits, the proposed DisContSE excels top-reported time- and frequency-domain diffusion baseline methods in PESQ, POLQA, UTMOS, and in a subjective ITU-T P.808 listening test, clearly achieving an overall top rank.
Primary: unknown
All Institutions: unknown
The paper presents DisContSE, a novel diffusion-based speech enhancement model that effectively integrates discrete codec tokens and continuous embeddings, achieving state-of-the-art results while significantly reducing inference complexity. This contribution is poised to advance the field of speech processing, particularly in enhancing audio quality in real-time applications.
The methodology is well-structured, combining discrete and continuous embeddings to enhance speech quality while reducing computational complexity. The introduction of a single-step reverse process is innovative and addresses a significant limitation in existing diffusion models. The use of quantization error mask initialization is a novel approach that enhances the model's efficiency and effectiveness.
The experiments are thorough, utilizing a large-scale dataset and comparing against multiple state-of-the-art methods. The results demonstrate significant improvements across various metrics, indicating the robustness of the proposed model. The subjective listening tests add credibility to the findings.
The paper provides sufficient implementation details, including training configurations and metrics used, which aids in reproducibility. However, the lack of access to the actual code or model weights limits full reproducibility.
The paper does not address potential limitations in terms of generalizability across different languages or accents, nor does it discuss the computational requirements for real-time applications. Additionally, the subjective nature of some evaluation metrics may introduce bias.
The proposed model has significant implications for real-time speech enhancement applications, particularly in scenarios with low-quality audio inputs. Its efficiency could facilitate broader adoption in consumer electronics and assistive technologies. The paper presents DisContSE, a novel diffusion-based speech enhancement model that effectively integrates discrete codec tokens and continuous embeddings, achieving state-of-the-art results while significantly reducing inference complexity. This contribution is poised to advance the field of speech processing, particularly in enhancing audio quality in real-time applications.
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
Primary: Google DeepMind
All Institutions: Google DeepMind
The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
The methodology is robust, introducing PhaseCoder as a transformer-only spatial audio encoder that is microphone geometry-agnostic. The authors effectively leverage raw multichannel audio and microphone coordinates to produce spatial embeddings, which is a significant advancement over existing methods that are limited by fixed geometries. The use of a two-stage training strategy and synthetic data generation is well-justified, addressing the lack of real-world datasets. The architecture, including positional embeddings and the integration with the Gemma 3n LLM, is thoughtfully designed to enhance spatial reasoning capabilities.
The experimental evaluation is thorough, with clear benchmarks against state-of-the-art models like GI-DOAEnet. The results demonstrate that PhaseCoder achieves competitive performance on localization tasks, even outperforming existing models in certain scenarios. The evaluation of the fine-tuned LLM on spatial reasoning tasks is particularly noteworthy, showcasing the model's ability to handle complex queries related to spatial audio understanding. However, the reliance on synthetic datasets may raise questions about generalizability.
The paper provides detailed implementation details, including training configurations, data generation processes, and model architecture. While the methodology is well-documented, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate further research and validation.
The primary limitations include the assumption of static sources and the focus on single-speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, the model's performance could be impacted by the lack of explicit modeling of acoustic properties and dynamic sources. Future iterations should address these aspects to enhance robustness and applicability.
This work has significant implications for various applications, including assistive technologies for the hearing-impaired, improving human-robot interaction, and advancing embodied AI systems. By enabling spatial audio understanding across diverse devices, it promotes accessibility and adaptability in AI technologies, potentially transforming how users interact with their environments. The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high-quality sound infusions across diverse categories, representing a step toward more controllable and concept-driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .
Primary: Northwestern University
All Institutions: Northwestern University
Mix2Morph represents a substantial advancement in sound morphing techniques, leveraging innovative training strategies and augmentation methods to produce high-quality audio infusions. The comprehensive evaluation of its performance against existing models highlights its potential to enhance sound design practices significantly.
The paper presents a novel approach to sound morphing through the Mix2Morph model, which utilizes a finetuning strategy on noisy surrogate mixes. This method allows the model to learn morphing without the need for a dedicated morph dataset, addressing a significant limitation in the field. The use of higher diffusion timesteps to focus on capturing high-level morphing concepts while suppressing low-level artifacts is particularly innovative. The augmentation techniques, including temporal and spectral alignment, are well-justified and enhance the model's ability to produce coherent morphs. However, the methodology could benefit from a more detailed discussion on the choice of augmentation modes and their impact on the results.
The experiments are comprehensive, evaluating Mix2Morph against several baselines through both objective metrics and subjective listening tests. The paper provides a clear rationale for the selection of sound pairs and the design of the evaluation metrics, including the Latent Compressibility Score (LCS) and directionality measures. The results demonstrate that Mix2Morph consistently outperforms existing methods, showcasing its effectiveness in generating high-quality sound infusions. The statistical analysis of the subjective evaluations adds robustness to the findings.
The paper includes sufficient details regarding the model architecture and training procedures, allowing for reproducibility. However, the lack of a publicly available code repository limits the ease with which other researchers can replicate the results. Providing access to the training data or a similar dataset would further enhance reproducibility.
One limitation is the reliance on noisy surrogate mixes, which may not fully capture the complexity of high-quality morphs. Additionally, while the model shows improvements over baselines, there may still be cases where perceptual coherence is not fully achieved, particularly with more complex sound pairs. The subjective evaluation is limited to a small sample size, which may not represent the broader community's perceptions.
The Mix2Morph model has significant implications for sound design, particularly in fields such as film, gaming, and virtual reality, where high-quality sound morphing is essential for creating immersive experiences. The ability to generate sound infusions without extensive datasets opens new avenues for creativity and exploration in audio production. Mix2Morph represents a substantial advancement in sound morphing techniques, leveraging innovative training strategies and augmentation methods to produce high-quality audio infusions. The comprehensive evaluation of its performance against existing models highlights its potential to enhance sound design practices significantly.
Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.
Primary: National Institute of Informatics
All Institutions: National Institute of Informatics
This paper presents a significant advancement in understanding the vulnerabilities of audio watermarking systems against modern voice conversion techniques. The research effectively combines theoretical insights with practical evaluations, making it a valuable contribution to the fields of audio processing and security.
The paper introduces a novel attack method, self voice conversion (VC), which effectively preserves speaker identity and linguistic content while degrading the performance of audio watermarking systems. The methodology is well-structured, detailing the attack framework and the specific voice conversion models employed (kNN-VC and RVC). The authors provide a clear rationale for the choice of methods and their relevance to the threat model, demonstrating a deep understanding of both the watermarking and voice conversion domains. However, the methodology could benefit from more extensive comparisons with other potential attack strategies to further validate its effectiveness.
The experimental evaluation is robust, employing a variety of watermarking systems and assessing their performance under the proposed self VC attack. The results are clearly presented in tables, showing the degradation of watermark extraction accuracy across different methods. The use of standard datasets (LibriTTS) and metrics (WER, UTMOS) adds credibility to the findings. However, the paper could improve by including more detailed statistical analyses to support the significance of the results.
The paper lacks specific implementation details or access to code and models, which limits reproducibility. While the authors mention that source code and model checkpoints will not be publicly released due to potential security implications, providing at least some implementation details or pseudo-code would enhance reproducibility and allow other researchers to validate the findings.
One limitation is the lack of real-world testing scenarios; the experiments are conducted under controlled conditions that may not fully capture the complexities of real-world audio processing and watermarking. Additionally, the reliance on specific voice conversion models may limit the generalizability of the findings to other models or methods not considered in the study.
The implications of this research are significant, as it highlights vulnerabilities in current audio watermarking techniques in the face of advanced voice conversion technologies. This work could inform the development of more robust watermarking methods and raise awareness about the potential misuse of voice conversion technologies in various applications, including copyright infringement and misinformation. This paper presents a significant advancement in understanding the vulnerabilities of audio watermarking systems against modern voice conversion techniques. The research effectively combines theoretical insights with practical evaluations, making it a valuable contribution to the fields of audio processing and security.
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Primary: Idiap Research Institute
All Institutions: Idiap Research Institute, Uniphore
The main contribution of this paper is the introduction of a novel text-only adaptation method for LLM-based ASR, which effectively preserves cross-modal alignment while improving performance in various adaptation scenarios. This work represents a meaningful advancement in the field, addressing a critical challenge in ASR systems and providing a practical solution for domain adaptation.
The proposed methodology introduces a novel approach to text-only adaptation in LLM-based ASR by framing the adaptation as a text denoising task. This reframing is innovative as it allows the model to learn from noisy text inputs without requiring additional parameters or architectural changes. The use of a multi-view noise-driven batching strategy is particularly effective in maintaining the alignment between speech and text modalities, which is a critical aspect of ASR systems. The authors provide a clear explanation of how the noise function is constructed and how it facilitates the training process, making the methodology both sound and theoretically grounded.
The experimental evaluation is thorough, utilizing two distinct datasets that represent realistic conversational scenarios. The results demonstrate significant improvements in WER across various adaptation scenarios, showcasing the effectiveness of the proposed method compared to existing techniques. The inclusion of ablation studies further strengthens the findings by isolating the contributions of key components in the training process. However, the paper could benefit from additional details on the experimental setup, such as hyperparameter tuning and validation strategies.
The paper provides a solid foundation for reproducibility with a detailed description of the experimental setup, including the models used and the training process. However, the lack of publicly available code or datasets limits the ability for other researchers to fully replicate the results. Including a link to a GitHub repository or similar would enhance the reproducibility of the findings.
One limitation of the proposed approach is its reliance on the quality of the noise function, which may not perfectly emulate the outputs of a speech projector. Additionally, while the method shows promise in various adaptation scenarios, performance still lags behind audio-based adaptation, indicating that further improvements are needed. The paper also does not address the potential computational costs associated with the proposed training strategy.
This research has significant implications for the field of ASR, particularly in scenarios where audio data is scarce or expensive to obtain. By enabling effective adaptation using only text data, the proposed method could facilitate the deployment of ASR systems in diverse domains, enhancing accessibility and usability in real-world applications. The approach could also inspire further research into alternative adaptation strategies that leverage text data in innovative ways. The main contribution of this paper is the introduction of a novel text-only adaptation method for LLM-based ASR, which effectively preserves cross-modal alignment while improving performance in various adaptation scenarios. This work represents a meaningful advancement in the field, addressing a critical challenge in ASR systems and providing a practical solution for domain adaptation.
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Primary: Idiap Research Institute
All Institutions: Idiap Research Institute, Uniphore
The main contribution of this paper is a novel text-only adaptation method for LLM-based ASR systems that reformulates the adaptation challenge as a text denoising task, achieving substantial performance improvements without requiring additional parameters or architectural changes. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in the context of domain adaptation.
The proposed methodology effectively reframes the adaptation of LLM-based ASR systems as a text denoising task, which is innovative in its approach to preserving cross-modal alignment without requiring additional parameters or architectural changes. The multi-view noise-driven batching strategy is a clever solution to mitigate catastrophic forgetting, allowing the model to leverage both source and target domain data effectively. However, while the method is lightweight, it relies heavily on the quality of the noise function and the careful balancing of batch components, which could be sensitive to implementation details.
The experiments are well-structured, assessing the proposed method across three distinct adaptation scenarios: in-domain, out-of-domain, and cross-domain. The results demonstrate significant improvements in WER, with the method outperforming existing text-only adaptation techniques. The use of multiple datasets adds robustness to the findings, although the reliance on specific conversational corpora may limit generalizability to other domains.
The paper provides sufficient implementation details regarding the experimental setup, including model architectures, training parameters, and dataset descriptions. However, the lack of publicly available code or data limits the reproducibility of the results, which is a significant consideration for the research community.
One limitation is the potential sensitivity of the proposed method to the choice of noise function and the batch composition strategy. Additionally, while the method shows improvements, it still falls short of performance levels achieved with audio-based adaptation, indicating that further refinements are needed. The reliance on specific datasets may also restrict the applicability of the findings to other domains.
The research has significant implications for the field of ASR, particularly in scenarios where audio data is scarce or expensive to obtain. The ability to adapt LLM-based ASR systems using only text data could enhance the accessibility and scalability of ASR technologies across various applications, including assistive technologies and conversational AI. The main contribution of this paper is a novel text-only adaptation method for LLM-based ASR systems that reformulates the adaptation challenge as a text denoising task, achieving substantial performance improvements without requiring additional parameters or architectural changes. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in the context of domain adaptation.
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.
Primary: UncovAI
All Institutions: UncovAI, GENCI-IDRIS
The main contribution of this paper is a systematic evaluation of advanced TTS models against various detection frameworks, revealing significant challenges in audio deepfake detection. This work is crucial in addressing the evolving landscape of synthetic speech technologies and highlights the necessity for integrated detection strategies to combat emerging threats in audio forensics.
The paper employs a systematic approach to evaluate the performance of three advanced TTS models (Dia2, Maya1, and MeloTTS) against multiple detection frameworks. The authors construct a novel dataset of 12,000 synthetic audio samples, ensuring a diverse representation of modern TTS architectures. The methodology is well-structured, utilizing a multi-faceted detection strategy that combines semantic, structural, and signal-level analyses. However, the reliance on specific models and the absence of a broader range of TTS systems may limit the generalizability of the findings.
The experiments are comprehensive, with a clear focus on evaluating the performance of different detection models against the generated audio samples. The use of various metrics (EER, AUC, F1-Score) provides a robust framework for assessing detection capabilities. The results indicate significant variability in detector performance, particularly highlighting the challenges posed by LLM-based synthesis. The paper successfully demonstrates the limitations of single-paradigm detectors, emphasizing the need for integrated detection strategies.
The paper lacks detailed implementation specifics, such as hyperparameters, training protocols, and model architectures, which may hinder reproducibility. While the methodology is described, the absence of code or supplementary materials limits the ability for other researchers to replicate the experiments fully.
One limitation is the focus on only three TTS models, which may not encompass the full spectrum of current TTS technologies. Additionally, the dataset is derived from a single source (DailyDialog), potentially introducing biases that could affect the generalizability of the results. The paper also does not address the potential for adversarial attacks on detection models, which is a critical aspect in real-world applications.
This research has significant implications for the fields of audio forensics and security, particularly as TTS technologies continue to evolve. The findings underscore the importance of developing robust detection frameworks that can adapt to new generative mechanisms. The work could inform future research directions and the development of more resilient audio deepfake detection systems. The main contribution of this paper is a systematic evaluation of advanced TTS models against various detection frameworks, revealing significant challenges in audio deepfake detection. This work is crucial in addressing the evolving landscape of synthetic speech technologies and highlights the necessity for integrated detection strategies to combat emerging threats in audio forensics.
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upon request. Existing approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experimental results show that TruS effectively prevents voice generation on both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis. The demo and code are available on http://mmai.ewha.ac.kr/trus.
Primary: Ewha Womans University
All Institutions: Ewha Womans University
The main contribution of this paper is the introduction of TruS, a training-free speaker unlearning framework that effectively prevents the generation of specific speaker identities in zero-shot TTS models. This work represents a meaningful advancement in the field of audio machine learning, addressing critical privacy concerns while maintaining the expressivity of synthesized speech.
The proposed TruS framework introduces a novel approach to speaker unlearning by focusing on inference-time control rather than traditional data deletion methods. This is a significant shift in perspective, as it allows for the suppression of specific speaker identities without the need for retraining the model. The methodology is well-structured, leveraging hidden activations to maintain other speech attributes, which is a clever way to balance the need for privacy with the expressivity of TTS systems. However, the paper could benefit from a more detailed explanation of the underlying mechanisms that allow for this control, particularly in terms of how it identifies and manipulates the relevant activations.
The experimental results are compelling, demonstrating the effectiveness of TruS in preventing voice generation for both seen and unseen opt-out speakers. The evaluation metrics used appear to be appropriate for assessing the framework's performance, and the results are presented clearly. However, the paper would be strengthened by including more extensive comparisons with existing methods, particularly those that involve retraining, to better contextualize the advantages of the proposed approach.
The paper mentions that the demo and code are available online, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics that would help other researchers replicate the results. Providing a clearer description of the datasets used, the experimental setup, and the parameters would enhance reproducibility.
One limitation is the reliance on the model's ability to suppress specific identities without retraining, which may not be universally applicable across all TTS architectures. Additionally, the paper does not address potential edge cases where the suppression might fail or lead to unintended consequences in voice synthesis. The scalability of the approach in real-world applications is also not thoroughly discussed.
The implications of this research are significant, particularly in the context of privacy and ethical concerns surrounding voice synthesis technologies. By providing a method for speaker unlearning, this work could help mitigate risks associated with unauthorized voice generation, thereby enhancing user trust in TTS systems. The framework has potential applications in various fields, including entertainment, security, and personal privacy. The main contribution of this paper is the introduction of TruS, a training-free speaker unlearning framework that effectively prevents the generation of specific speaker identities in zero-shot TTS models. This work represents a meaningful advancement in the field of audio machine learning, addressing critical privacy concerns while maintaining the expressivity of synthesized speech.
Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.
Primary: Hangzhou Dianzi University
All Institutions: Hangzhou Dianzi University
The main contribution of this work is the introduction of SwitchCodec, a high-fidelity neural audio codec that utilizes Residual Experts Vector Quantization (REVQ) to improve audio compression efficiency and adaptability. This innovative approach addresses the limitations of existing codecs, demonstrating substantial improvements in audio quality across a range of bitrates, thereby advancing the field of neural audio coding.
The paper introduces SwitchCodec, which innovatively employs Residual Experts Vector Quantization (REVQ) to enhance audio compression. The methodology effectively combines a shared quantizer with dynamically routed expert quantizers, addressing the limitations of fixed quantization structures. The dual-path design allows for improved compression efficiency by decoupling bitrate from codebook capacity, which is a significant advancement over traditional methods. The use of a gating network to select quantizers based on audio content is a notable feature that enhances adaptability and performance.
The experiments are robust, utilizing multiple datasets (VCTK, LibriTTS, FMA, Common Voice) and employing both objective metrics (ViSQOL, Mel-spectrogram distance, STFT distance, PESQ) and subjective listening tests (MUSHRA). The results demonstrate that SwitchCodec outperforms existing baselines (DAC and EnCodec) across various bitrates, indicating its effectiveness in maintaining audio quality while optimizing bitrate. The thorough evaluation across different audio types strengthens the paper's claims.
The implementation details are adequately described, including training parameters, dataset preparation, and model architecture. The authors provide a demo URL with audio samples, which enhances reproducibility. However, the absence of a public code repository limits the ease of full reproducibility.
While the proposed method shows significant improvements, the paper does not extensively discuss potential limitations, such as the computational overhead introduced by the routing mechanism or the need for careful tuning of the gating network. Additionally, the scalability of the approach to more complex audio types or real-time applications is not addressed.
The advancements presented in SwitchCodec have the potential to significantly impact audio streaming and storage solutions, particularly in bandwidth-constrained environments. The ability to adaptively allocate bitrate based on content complexity could lead to more efficient use of resources in various applications, including music streaming, telecommunications, and multimedia content delivery. The main contribution of this work is the introduction of SwitchCodec, a high-fidelity neural audio codec that utilizes Residual Experts Vector Quantization (REVQ) to improve audio compression efficiency and adaptability. This innovative approach addresses the limitations of existing codecs, demonstrating substantial improvements in audio quality across a range of bitrates, thereby advancing the field of neural audio coding.
Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.
Primary: University of Cambridge
All Institutions: University of Cambridge, Laboratoire Informatique d'Avignon, Laboratoire d'Informatique de Grenoble
This paper provides valuable insights into the importance of data selection strategies in self-supervised speech models, highlighting the critical role of utterance length over diversity. The findings challenge existing assumptions in the field and offer a new perspective that could reshape future research and applications in speech processing.
The paper presents a systematic exploration of data selection strategies for pre-training self-supervised speech models, specifically focusing on the impact of utterance length versus diversity in acoustic, speaker, and linguistic features. The methodology is robust, employing a large-scale dataset (Loquacious) and a well-defined experimental setup that includes various sampling strategies. However, the reliance on simple unsupervised data selection methods, while insightful, may not fully leverage more complex data selection techniques that could yield even more significant results.
The experimental design is comprehensive, utilizing a substantial dataset and comparing multiple data selection strategies against baselines. The results clearly indicate that prioritizing longer utterances leads to better ASR performance, which is a significant finding. However, the paper could benefit from additional statistical analysis to further validate the robustness of the results across different datasets or conditions.
The authors commit to making their code public for reproducibility, which is a positive aspect. However, the paper lacks specific URLs for the code repository, which would facilitate easier access for other researchers. Detailed descriptions of the training settings and hyperparameters are provided, enhancing reproducibility.
The study's limitations include a narrow focus on data length without exploring more sophisticated data selection techniques. Additionally, the findings may not generalize across different languages or domains, as the study is primarily based on English speech data. The paper also does not address the potential trade-offs between data length and other factors that might influence performance.
The findings have significant implications for the development of self-supervised learning models in speech processing, particularly in optimizing pre-training datasets for efficiency. By demonstrating that longer utterances can yield better performance, this research could influence future practices in data collection and model training, potentially leading to more efficient use of computational resources in the field. This paper provides valuable insights into the importance of data selection strategies in self-supervised speech models, highlighting the critical role of utterance length over diversity. The findings challenge existing assumptions in the field and offer a new perspective that could reshape future research and applications in speech processing.