Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
Primary: Google DeepMind
All Institutions: Google DeepMind
The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
The methodology is robust, introducing PhaseCoder as a transformer-only spatial audio encoder that is microphone geometry-agnostic. The authors effectively leverage raw multichannel audio and microphone coordinates to produce spatial embeddings, which is a significant advancement over existing methods that are limited by fixed geometries. The use of a two-stage training strategy and synthetic data generation is well-justified, addressing the lack of real-world datasets. The architecture, including positional embeddings and the integration with the Gemma 3n LLM, is thoughtfully designed to enhance spatial reasoning capabilities.
The experimental evaluation is thorough, with clear benchmarks against state-of-the-art models like GI-DOAEnet. The results demonstrate that PhaseCoder achieves competitive performance on localization tasks, even outperforming existing models in certain scenarios. The evaluation of the fine-tuned LLM on spatial reasoning tasks is particularly noteworthy, showcasing the model's ability to handle complex queries related to spatial audio understanding. However, the reliance on synthetic datasets may raise questions about generalizability.
The paper provides detailed implementation details, including training configurations, data generation processes, and model architecture. While the methodology is well-documented, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate further research and validation.
The primary limitations include the assumption of static sources and the focus on single-speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, the model's performance could be impacted by the lack of explicit modeling of acoustic properties and dynamic sources. Future iterations should address these aspects to enhance robustness and applicability.
This work has significant implications for various applications, including assistive technologies for the hearing-impaired, improving human-robot interaction, and advancing embodied AI systems. By enabling spatial audio understanding across diverse devices, it promotes accessibility and adaptability in AI technologies, potentially transforming how users interact with their environments. The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
Emotion recognition is inherently ambiguous, with uncertainty arising both from rater disagreement and from discrepancies across modalities such as speech and text. There is growing interest in modeling rater ambiguity using label distributions. However, modality ambiguity remains underexplored, and multimodal approaches often rely on simple feature fusion without explicitly addressing conflicts between modalities. In this work, we propose AmbER$^2$, a dual ambiguity-aware framework that simultaneously models rater-level and modality-level ambiguity through a teacher-student architecture with a distribution-wise training objective. Evaluations on IEMOCAP and MSP-Podcast show that AmbER$^2$ consistently improves distributional fidelity over conventional cross-entropy baselines and achieves performance competitive with, or superior to, recent state-of-the-art systems. For example, on IEMOCAP, AmbER$^2$ achieves relative improvements of 20.3% on Bhattacharyya coefficient (0.83 vs. 0.69), 13.6% on R$^2$ (0.67 vs. 0.59), 3.8% on accuracy (0.683 vs. 0.658), and 4.5% on F1 (0.675 vs. 0.646). Further analysis across ambiguity levels shows that explicitly modeling ambiguity is particularly beneficial for highly uncertain samples. These findings highlight the importance of jointly addressing rater and modality ambiguity when building robust emotion recognition systems.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology
The main contribution of this paper is the introduction of AmbER$^2$, a dual ambiguity-aware emotion recognition framework that effectively models both rater and modality ambiguity, significantly improving the fidelity of emotion predictions in multimodal contexts. This work represents a meaningful advancement in the field of emotion recognition, particularly in its approach to handling the inherent complexities and uncertainties associated with emotional expression across different modalities.
The proposed AmbER$^2$ framework introduces a dual ambiguity-aware approach to emotion recognition, effectively addressing both rater and modality ambiguity through a teacher-student architecture. This methodology is innovative as it combines distribution-wise training objectives with adaptive guidance from modality-specific heads, which is a significant advancement over traditional feature fusion methods. The use of a weighted consistency loss that adjusts the influence of modality experts based on their reliability is particularly noteworthy, contributing to a more nuanced understanding of emotional cues across different modalities.
The experiments conducted on the IEMOCAP and MSP-Podcast datasets are comprehensive and well-structured, demonstrating the effectiveness of the proposed framework against conventional baselines and state-of-the-art systems. The reported improvements in distributional metrics (e.g., Bhattacharyya coefficient, R²) and classification metrics (e.g., accuracy, F1 score) provide strong evidence of the framework's performance. The analysis across different ambiguity levels adds depth to the evaluation, showcasing the framework's robustness in handling varying degrees of uncertainty.
The paper provides sufficient implementation details, including the architecture of the models, training parameters, and the datasets used. However, the absence of a publicly available code repository or demo URL limits reproducibility. Future work should consider releasing the code to facilitate validation and further exploration by the research community.
One limitation is the reliance on specific datasets (IEMOCAP and MSP-Podcast), which may not fully represent the diversity of real-world emotional expressions across cultures and languages. Additionally, while the framework shows promise in handling ambiguity, the complexity of the model may pose challenges in real-time applications. The paper could also benefit from a more detailed discussion on the computational efficiency and scalability of the proposed approach.
The findings of this research have significant implications for the development of more robust emotion recognition systems, which can enhance human-machine interactions in various applications, including virtual assistants, mental health monitoring, and customer service automation. By addressing ambiguity in emotion recognition, the framework paves the way for more human-aligned affective computing systems that can better understand and respond to human emotions. The main contribution of this paper is the introduction of AmbER$^2$, a dual ambiguity-aware emotion recognition framework that effectively models both rater and modality ambiguity, significantly improving the fidelity of emotion predictions in multimodal contexts. This work represents a meaningful advancement in the field of emotion recognition, particularly in its approach to handling the inherent complexities and uncertainties associated with emotional expression across different modalities.
Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel ASR on the separated speech. This approach results in an inefficient, lengthy pipeline and sub-optimal ASR performance due to the accumulated errors from preprocessing modules. Furthermore, most spatial feature extraction methods depend on the knowledge of speaker positions and microphone topology, making the systems reliant on specific settings and challenging to adapt to new equipment. In this work, we propose a solution to these issues with a lightweight embedding module named SpatialEmb, which extracts and encodes spatial information directly for the ASR model, supporting both fixed and arbitrary microphone topology. We conduct comprehensive experiments on AliMeeting, a real meeting corpus, to determine the optimal model design for SpatialEmb in terms of both performance and efficiency. Our best model trained with 105 hours Train-Ali-far achieves 17.04% and 20.32% character error rates (CER) on the Eval and Test sets, establishing a new state-of-the-art result with the same training data.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Tencent AI Lab
The main contribution of this paper is the development of SpatialEmb, a novel embedding module that enhances multi-channel ASR performance by directly integrating spatial information, leading to improved efficiency and accuracy in speech recognition tasks. This work represents a meaningful step forward in the field of audio processing and ASR, addressing critical limitations of existing systems.
The paper introduces a novel embedding module, SpatialEmb, which directly extracts and encodes spatial information for ASR, bypassing traditional multi-stage systems that rely on preprocessing. The methodology is well-structured, employing a lightweight design that supports arbitrary microphone topologies. The use of various embedding structures (Conv2d, ConvNext, GRU-Conv2d) and the parameter-free divide-average-concatenate (DAC) method to enhance efficiency is particularly innovative. The integration of spatial features with spectral features in a 1-stage ASR system is a significant advancement over existing methods.
The authors conduct comprehensive experiments on the AliMeeting dataset, demonstrating the effectiveness of their proposed model. The results show a clear improvement in character error rates (CER) compared to previous state-of-the-art systems, establishing the proposed method as a competitive alternative. The evaluation metrics are robust, and the experiments are well-documented, providing a thorough comparison with existing techniques.
The paper references the Icefall framework for implementation, which aids reproducibility. However, the lack of a demo or direct access to the code repository limits the ease with which other researchers can replicate the results. Detailed descriptions of the experimental setup and parameters are provided, which is beneficial for reproducibility.
One limitation is the reliance on the AliMeeting dataset, which may not generalize well to other domains or languages. Additionally, while the proposed method supports arbitrary microphone topologies, the performance in real-world scenarios with varying conditions remains to be fully validated. The computational efficiency, while improved, still may not meet the demands of all real-time applications.
The advancements in multi-channel ASR systems have significant implications for applications in real-time communication, such as virtual meetings and automated transcription services. The ability to handle arbitrary microphone arrays enhances the adaptability of ASR systems in diverse environments, potentially leading to broader adoption in various industries. The main contribution of this paper is the development of SpatialEmb, a novel embedding module that enhances multi-channel ASR performance by directly integrating spatial information, leading to improved efficiency and accuracy in speech recognition tasks. This work represents a meaningful step forward in the field of audio processing and ASR, addressing critical limitations of existing systems.
Audio-to-image retrieval offers an interpretable alternative to audio-only classification for bioacoustic species recognition, but learning aligned audio-image representations is challenging due to the scarcity of paired audio-image data. We propose a simple and data-efficient approach that enables audio-to-image retrieval without any audio-image supervision. Our proposed method uses text as a semantic intermediary: we distill the text embedding space of a pretrained image-text model (BioCLIP-2), which encodes rich visual and taxonomic structure, into a pretrained audio-text model (BioLingual) by fine-tuning its audio encoder with a contrastive objective. This distillation transfers visually grounded semantics into the audio representation, inducing emergent alignment between audio and image embeddings without using images during training. We evaluate the resulting model on multiple bioacoustic benchmarks. The distilled audio encoder preserves audio discriminative power while substantially improving audio-text alignment on focal recordings and soundscape datasets. Most importantly, on the SSW60 benchmark, the proposed approach achieves strong audio-to-image retrieval performance exceeding baselines based on zero-shot model combinations or learned mappings between text embeddings, despite not training on paired audio-image data. These results demonstrate that indirect semantic transfer through text is sufficient to induce meaningful audio-image alignment, providing a practical solution for visually grounded species recognition in data-scarce bioacoustic settings.
Primary: Inria, LIRMM, Université de Montpellier
All Institutions: Inria, LIRMM, Université de Montpellier, Earth Species Project, University of Kassel
The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
The methodology presented in this paper is innovative as it proposes a contrastive distillation approach to bridge audio and image modalities without requiring paired data. By leveraging a pretrained image-text model (BioCLIP-2) to enhance the audio-text model (BioLingual), the authors effectively create a semantic intermediary that facilitates meaningful audio-to-image retrieval. The use of a contrastive objective for fine-tuning the audio encoder is well-justified and demonstrates a clear understanding of the underlying challenges in cross-modal representation learning. The simplicity of the approach, which avoids complex multi-objective training and direct image supervision, is a significant strength.
The experiments are robust, utilizing multiple bioacoustic benchmarks to validate the effectiveness of the proposed method. The results indicate that the distilled audio encoder not only improves audio-to-image retrieval performance but also preserves the discriminative capabilities of the audio model. The comparisons against various baselines, including zero-shot and text-embedding mapping strategies, provide a comprehensive evaluation of the method's effectiveness. The use of independent datasets for validation strengthens the credibility of the findings.
The paper mentions that the code will be publicly available after review, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics, such as hyperparameter settings, training duration, and computational resources, which are essential for other researchers to replicate the experiments fully.
One limitation of the study is the reliance on the quality and representativeness of the textual descriptions used for training the audio encoder. If the textual descriptions are not sufficiently diverse or comprehensive, it may impact the generalization of the model. Additionally, while the approach demonstrates strong performance on the evaluated datasets, its applicability to other domains or species not represented in the training data remains uncertain.
The implications of this research are significant for biodiversity monitoring and conservation efforts, particularly in scenarios where paired audio-image data is scarce. By enabling effective audio-to-image retrieval, the proposed method can assist researchers and conservationists in identifying species based on audio recordings, thus enhancing ecological studies and wildlife conservation strategies. The main contribution of this paper is the introduction of a novel contrastive distillation method for audio-to-image retrieval that effectively utilizes text as a semantic intermediary, significantly advancing the field of bioacoustic species recognition. The technical contributions are substantial, providing a practical solution to a challenging problem in a data-scarce environment, and the methodology is both innovative and well-executed, with promising experimental results.
Diffusion models have recently set new benchmarks in Speech Enhancement (SE). However, most existing score-based models treat speech spectrograms merely as generic 2D images, applying uniform processing that ignores the intrinsic structural sparsity of audio, which results in inefficient spectral representation and prohibitive computational complexity. To bridge this gap, we propose DVPD, an extremely lightweight Dual-View Predictive Diffusion model, which uniquely exploits the dual nature of spectrograms as both visual textures and physical frequency-domain representations across both training and inference stages. Specifically, during training, we optimize spectral utilization via the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which preserves critical low-frequency harmonics while pruning high-frequency redundancies. Simultaneously, we introduce a Lightweight Image-based Spectro-Awareness (LISA) module to capture features from a visual perspective with minimal overhead. During inference, we propose a Training-free Lossless Boost (TLB) strategy that leverages the same dual-view priors to refine generation quality without any additional fine-tuning. Extensive experiments across various benchmarks demonstrate that DVPD achieves state-of-the-art performance while requiring only 35% of the parameters and 40% of the inference MACs compared to SOTA lightweight model, PGUSE. These results highlight DVPD's superior ability to balance high-fidelity speech quality with extreme architectural efficiency. Code and audio samples are available at the anonymous website: {https://anonymous.4open.science/r/dvpd_demo-E630}
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology, Tsinghua University, Sun Yat-sen University
The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
The proposed Dual-View Predictive Diffusion (DVPD) model introduces a novel approach to speech enhancement by leveraging the dual nature of spectrograms as both visual textures and physical frequency-domain representations. The methodology includes the Frequency-Adaptive Non-uniform Compression (FANC) encoder, which effectively preserves critical low-frequency harmonics while reducing high-frequency redundancies, and the Lightweight Image-based Spectro-Awareness (LISA) module, which captures features from a visual perspective. The Training-free Lossless Boost (TLB) strategy further enhances the model's performance during inference without additional training, showcasing a well-thought-out integration of predictive and generative paradigms.
The experiments are extensive, covering various benchmarks including WSJ0-UNI and VBDMD, demonstrating the model's effectiveness across different distortion scenarios. The results indicate that DVPD achieves state-of-the-art performance while significantly reducing computational complexity compared to existing models. The comprehensive evaluation metrics used, such as PESQ and ESTOI, provide a robust assessment of the model's capabilities.
The paper includes detailed implementation details, including training configurations, loss functions, and evaluation metrics, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction for other researchers.
While the model demonstrates impressive performance, it may still struggle with certain types of distortions not covered in the training datasets. Additionally, the reliance on specific hyperparameters for the TLB strategy may introduce variability in performance across different applications.
The advancements presented in this paper have significant implications for real-world applications in speech enhancement, particularly in noisy environments. The lightweight nature of the model makes it suitable for deployment in resource-constrained settings, potentially benefiting various industries, including telecommunications and assistive technologies. The paper presents a significant contribution to the field of speech enhancement by introducing a novel dual-view approach that balances high-fidelity speech quality with computational efficiency. The comprehensive methodology and rigorous experimental evaluation underscore its potential impact on future research and applications in audio processing.
Imperceptible text-based speech editing allows users to modify spoken content by altering the transcript. It demands that modified segments fuse seamlessly with the surrounding context. Prevalent methods operating in the acoustic space suffer from inherent content-style entanglement, leading to generation instability and boundary artifacts. In this paper, we propose a novel framework grounded in the principle of "Edit Content, Preserve Acoustics". Our approach relies on two core components: (1) Structural Foundations, which decouples editing into a stable semantic space while delegating acoustic reconstruction to a Flow Matching decoder; and (2) Perceptual Alignment, which employs a novel Self-Consistency Rewards Group Relative Policy Optimization. By leveraging a pre-trained Text-to-Speech model as an implicit critic -- complemented by strict intelligibility and duration constraints -- we effectively align the edited semantic token sequence with the original context. Empirical evaluations demonstrate that our method significantly outperforms state-of-the-art autoregressive and non-autoregressive baselines, achieving superior intelligibility, robustness, and perceptual quality.
Primary: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences
All Institutions: The State Key Laboratory of Multimodal Artificial Intelligence Systems, Chinese Academy of Sciences, School of Artificial Intelligence, University of Chinese Academy of Sciences, Department of Automation, Tsinghua University, Beijing National Research Center for Information Science and Technology, Tsinghua University
The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
The proposed methodology introduces a novel framework for text-based speech editing that effectively decouples semantic content from acoustic features, addressing the limitations of existing methods that often lead to artifacts and instability. The use of a Flow Matching decoder for acoustic reconstruction and a Self-Consistency Rewards mechanism for perceptual alignment is innovative and well-justified, leveraging a pre-trained TTS model as an implicit critic. This dual-stage approach enhances both intelligibility and naturalness, making significant strides in the field of speech editing.
The experiments are comprehensive, utilizing a large-scale dataset (Libriheavy) and rigorous benchmarks for evaluation. The authors provide detailed comparisons against state-of-the-art models, demonstrating clear improvements in metrics such as WER, speaker similarity, and perceptual quality. The use of both objective and subjective metrics strengthens the evaluation, although further details on the statistical significance of results would enhance the robustness of the findings.
The paper includes sufficient implementation details, including training configurations and the architecture of the models used. However, the absence of a publicly available code repository limits full reproducibility. Providing access to the code and trained models would significantly enhance the paper's impact and allow for independent verification of results.
While the proposed method shows strong performance, the paper does not address potential limitations in terms of computational efficiency or the scalability of the approach to diverse languages or dialects. Additionally, the reliance on a pre-trained TTS model may introduce biases based on the training data used for that model.
The implications of this research are significant for various applications, including media production, accessibility technologies, and real-time speech editing in communication tools. The ability to edit speech seamlessly could enhance user experience and efficiency in numerous fields, from entertainment to education. The paper presents a novel framework for imperceptible text-based speech editing that effectively separates content modification from acoustic reconstruction. This approach significantly advances the state of the art, addressing key challenges in speech editing and offering promising applications across multiple domains.
High-fidelity general audio compression at ultra-low bitrates is crucial for applications ranging from low-bandwidth communication to generative audio-language modeling. Traditional audio compression methods and contemporary neural codecs are fundamentally designed for waveform reconstruction. As a result, when operating at ultra-low bitrates, these methods degrade rapidly and often fail to preserve essential information, leading to severe acoustic artifacts and pronounced semantic distortion. To overcome these limitations, we introduce Generative Audio Compression (GAC), a novel paradigm shift from signal fidelity to task-oriented effectiveness. Implemented within the AI Flow framework, GAC is theoretically grounded in the Law of Information Capacity. These foundations posit that abundant computational power can be leveraged at the receiver to offset extreme communication bottlenecks--exemplifying the More Computation, Less Bandwidth philosophy. By integrating semantic understanding at the transmitter with scalable generative synthesis at the receiver, GAC offloads the information burden to powerful model priors. Our 1.8B-parameter model achieves high-fidelity reconstruction of 32kHz general audio at an unprecedented bitrate of 0.275kbps. Even at 0.175kbps, it still preserves a strong intelligible audio transmission capability, which represents an about 3000x compression ratio, significantly outperforming current state-of-the-art neural codecs in maintaining both perceptual quality and semantic consistency.
Primary: Institute of Artificial Intelligence, China Telecom
All Institutions: Institute of Artificial Intelligence, China Telecom
The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
The proposed Generative Audio Compression (GAC) method represents a significant shift from traditional audio compression techniques by focusing on task-oriented effectiveness rather than pure signal fidelity. The integration of semantic understanding at the transmitter and generative synthesis at the receiver is a novel approach that leverages the Law of Information Capacity to optimize the trade-off between computation and bandwidth. The methodology is well-grounded in theoretical frameworks and employs advanced techniques such as latent-variable modeling and variational objectives, showcasing a comprehensive understanding of both audio processing and machine learning principles.
The experiments are robust, covering multiple audio domains (speech, general sound, and music) and employing both objective and subjective evaluation metrics. The results demonstrate GAC's superior performance in maintaining perceptual quality and semantic consistency at extremely low bitrates, significantly outperforming existing state-of-the-art methods. The use of diverse datasets and thorough evaluation metrics strengthens the credibility of the findings.
While the paper provides a detailed description of the methodology and experimental setup, it lacks explicit implementation details or links to code repositories, which could hinder reproducibility. The absence of a demo or project URL further limits the ability for others to replicate the results.
One notable limitation is the trade-off between perceptual quality and speaker identity preservation at lower bitrates, which could affect applications requiring high fidelity in speaker recognition. Additionally, the reliance on large model sizes may limit practical deployment in resource-constrained environments.
The implications of GAC are significant for applications in low-bandwidth communication and generative audio-language modeling, potentially transforming how audio is transmitted and processed in various contexts. The approach could lead to advancements in telecommunication, streaming services, and assistive technologies, making high-quality audio accessible even in challenging bandwidth scenarios. The paper introduces a novel paradigm for audio compression that prioritizes semantic understanding and generative synthesis, achieving unprecedented performance at ultra-low bitrates. This work not only advances the state-of-the-art in audio compression but also opens new avenues for research in generative models and communication theory.
Modern voice cloning (VC) can synthesize speech that closely matches a target speaker from only seconds of reference audio, enabling applications such as personalized speech interfaces and dubbing. In practical deployments, modern audio generation models inevitably encounter noisy reference audios, imperfect text prompts, and diverse downstream processing, which can significantly hurt robustness. Despite rapid progress in VC driven by autoregressive codec-token language models and diffusion-based models, robustness under realistic deployment shifts remains underexplored. This paper introduces RVCBench, a comprehensive benchmark that evaluates Robustness in VC across the full generation pipeline, including input variation, generation challenges, output post-processing, and adversarial perturbations, covering 10 robustness tasks, 225 speakers, 14,370 utterances, and 11 representative modern VC models. Our evaluation uncovers substantial robustness gaps in VC: performance can deteriorate sharply under common input shifts and post-processing; long-context and cross-lingual scenarios further expose stability limitations; and both passive noise and proactive perturbation influence generation robustness. Collectively, these findings provide a unified picture of how current VC models fail in practice and introduce a standardized, open-source testbed to support the development of more robust and deployable VC models. We open-source our project at https://github.com/Nanboy-Ronan/RVCBench.
Primary: The University of British Columbia
All Institutions: The University of British Columbia, Vector Institute
The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
The paper introduces RVCBench, a benchmark designed to evaluate the robustness of voice cloning models across various challenges. The methodology is comprehensive, covering a wide range of robustness tasks and including a significant dataset of 225 speakers and over 14,000 utterances. The authors systematically assess the performance of 11 modern voice cloning models under different conditions, which is a valuable approach to understanding the limitations of current technology. However, the paper could benefit from a more detailed explanation of how the robustness tasks were selected and the specific metrics used for evaluation.
The experiments are well-structured, with a clear focus on identifying performance gaps in voice cloning models under realistic conditions. The inclusion of various input variations and adversarial perturbations is a strong point, as it reflects real-world challenges. The results highlight significant robustness issues, which are crucial for advancing the field. However, the paper lacks a comparative analysis with existing benchmarks, which would strengthen its contributions.
The paper mentions that the project is open-sourced, which is a positive aspect for reproducibility. However, it lacks detailed implementation instructions or specific configurations used during experiments, which could hinder other researchers from replicating the results effectively.
One limitation is the potential bias in the selection of speakers and utterances, which may not represent the full diversity of voice characteristics in the real world. Additionally, while the benchmark covers various robustness tasks, it may not encompass all possible deployment scenarios that could affect voice cloning performance.
The findings of this paper have significant implications for the development of more robust voice cloning technologies, which could enhance applications in personalized speech interfaces and dubbing. By identifying and addressing robustness gaps, the research can contribute to safer and more reliable deployment of voice cloning systems in real-world applications. The main contribution of this paper is the introduction of RVCBench, a comprehensive benchmark for evaluating the robustness of voice cloning models under realistic conditions. This work significantly advances the understanding of the limitations of current voice cloning technologies and provides a valuable resource for future research aimed at improving their robustness and applicability.
We present ACE-Step v1.5, a highly efficient open-source music foundation model that brings commercial-grade generation to consumer hardware. On commonly used evaluation metrics, ACE-Step v1.5 achieves quality beyond most commercial music models while remaining extremely fast -- under 2 seconds per full song on an A100 and under 10 seconds on an RTX 3090. The model runs locally with less than 4GB of VRAM, and supports lightweight personalization: users can train a LoRA from just a few songs to capture their own style. At its core lies a novel hybrid architecture where the Language Model (LM) functions as an omni-capable planner: it transforms simple user queries into comprehensive song blueprints -- scaling from short loops to 10-minute compositions -- while synthesizing metadata, lyrics, and captions via Chain-of-Thought to guide the Diffusion Transformer (DiT). Uniquely, this alignment is achieved through intrinsic reinforcement learning relying solely on the model's internal mechanisms, thereby eliminating the biases inherent in external reward models or human preferences. Beyond standard synthesis, ACE-Step v1.5 unifies precise stylistic control with versatile editing capabilities -- such as cover generation, repainting, and vocal-to-BGM conversion -- while maintaining strict adherence to prompts across 50+ languages. This paves the way for powerful tools that seamlessly integrate into the creative workflows of music artists, producers, and content creators. The code, the model weights and the demo are available at: https://ace-step.github.io/ace-step-v1.5.github.io/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
The methodology introduces a hybrid architecture that combines a Language Model (LM) with a Diffusion Transformer (DiT) to generate music. The use of intrinsic reinforcement learning to align the LM's planning capabilities with the DiT's synthesis process is a notable innovation. The model's ability to generate music based on simple user queries and to personalize outputs with minimal input data is a significant advancement in the field of music generation. However, the paper could benefit from a more detailed explanation of the reinforcement learning mechanism and how it mitigates biases.
The paper claims that ACE-Step v1.5 achieves superior performance on commonly used evaluation metrics compared to existing commercial models. The reported generation times are impressive, especially for consumer hardware, and the ability to run on low VRAM is a practical advantage. However, the paper lacks detailed experimental results, including quantitative comparisons with baseline models, which would strengthen the claims made about performance and efficiency.
The availability of code, model weights, and a demo is a positive aspect, promoting reproducibility. However, the paper does not provide sufficient details on the training process, dataset specifics, or evaluation metrics used, which are crucial for other researchers to replicate the results effectively.
One limitation is the lack of extensive evaluation on diverse datasets to validate the model's performance across various music genres and styles. Additionally, the reliance on intrinsic reinforcement learning may limit the model's adaptability to more complex user preferences that external reward models could capture. The paper also does not address potential ethical considerations regarding music generation and copyright issues.
The potential applications of ACE-Step v1.5 are vast, ranging from aiding music artists in their creative processes to providing tools for content creators. Its ability to generate high-quality music quickly and with low resource requirements could democratize music production, making it accessible to a broader audience. However, the implications of AI-generated music on the music industry and artist livelihoods should be carefully considered. The main contribution of this paper is the introduction of ACE-Step v1.5, an efficient open-source music generation model that combines novel architectural elements with user-friendly personalization features. This work significantly advances the state of music generation technology, particularly for consumer hardware, while raising important questions regarding reproducibility and ethical implications in the field.
Query-based universal sound separation is fundamental to intelligent auditory systems, aiming to isolate specific sources from mixtures. Despite recent advances, existing methods continue to suffer from residual interference in complex acoustic scenes. This performance limitation stems largely from a data bottleneck: in-the-wild datasets contain weak labels and severe co-occurrence of events. These flaws induce models to learn spurious correlations between background noise and target categories instead of robust acoustic features. To address this, we propose an automated pipeline that eliminates co-occurrence of events by mining high-purity single-event segments from in-the-wild datasets via a semantically consistent synthesis protocol. Utilizing this pipeline, we constructed Hive, a high-quality synthetic dataset comprising 2.4k hours of raw audio. Experimental results demonstrate that, compared with the state-of-the-art model SAM-Audio which was trained on a huge dataset $\sim$500 times larger than Hive, certain open-source models trained on Hive achieve competitive separation accuracy and perceptual quality. Moreover, these models exhibited remarkable zero-shot generalization on out-of-distribution evaluation benchmarks. These findings highlight that prioritizing purity of supervised signals enables significant data efficiency, offering a new paradigm for training robust auditory foundation models with reduced computational costs. Code and dataset are available at https://shandaai.github.io/Hive.
Primary: Tsinghua University
All Institutions: Tsinghua University, Shanda AI Research, Johns Hopkins University, Chinese Institute for Brain Research
The main contribution of this paper is the introduction of Hive, a high-quality synthetic dataset for query-based universal sound separation, which demonstrates that prioritizing data purity can lead to significant improvements in model performance with reduced computational costs. The comprehensive methodology and experimental validation provide a strong foundation for future research in audio separation and related fields.
The paper presents a novel automated pipeline for data cleaning and synthesis, addressing the critical issue of co-occurrence in audio datasets. The authors propose a comprehensive approach that includes ontology reconstruction, semantic-acoustic alignment, and a semantically consistent mixing strategy. This methodology is well-structured and demonstrates a clear understanding of the challenges in query-based universal sound separation (USS). The use of multimodal large models for semantic filtering is particularly innovative, as it enhances the purity of the training data, which is crucial for effective model training.
The experimental results are robust, showcasing the effectiveness of the Hive dataset compared to existing large-scale datasets. The authors provide thorough evaluations using multiple models, demonstrating competitive performance in separation accuracy and perceptual quality. The zero-shot generalization capabilities of models trained on Hive further validate the dataset's utility. However, while the results are promising, the paper could benefit from additional comparative analyses with more diverse datasets to strengthen the claims.
The paper includes detailed implementation details and provides access to the dataset and code, which enhances reproducibility. The authors specify the training configurations and evaluation metrics used, allowing other researchers to replicate the experiments. However, the reliance on specific multimodal models for semantic alignment may limit reproducibility if those models are not widely accessible.
One notable limitation is the potential for bias in the automated pipeline, as it relies on model-based decisions that may propagate existing biases in the training data. Additionally, while the Hive dataset is designed to mitigate co-occurrence noise, it may not fully capture the complexities of real-world acoustic environments. The authors also acknowledge the ethical implications of their work, particularly concerning privacy and misuse of the technology.
The proposed methodology and dataset have significant implications for advancing computational auditory scene analysis and making robust auditory models more accessible. The focus on data efficiency could democratize AI applications in areas like immersive audio and assistive listening. However, the potential for misuse of the technology raises ethical concerns that need to be addressed through responsible deployment and usage guidelines. The main contribution of this paper is the introduction of Hive, a high-quality synthetic dataset for query-based universal sound separation, which demonstrates that prioritizing data purity can lead to significant improvements in model performance with reduced computational costs. The comprehensive methodology and experimental validation provide a strong foundation for future research in audio separation and related fields.
We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We additionally report results on the AMI corpus (IHM-mix condition) to validate performance on standardized speech mixtures.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Honda Research Institute Japan
The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
The proposed CALM framework introduces a novel joint Contextual Acoustic-Linguistic Modeling approach for multi-speaker ASR, integrating target-speaker conditioning with dynamic vocabulary expansion. This end-to-end framework leverages speaker embeddings for target-speaker extraction and contextual biasing, addressing both acoustic and linguistic challenges in overlapping speech scenarios. The methodology is well-structured, employing advanced techniques such as Conformer and Transformer architectures, and includes a comprehensive loss function that combines multiple objectives to enhance performance.
The experiments are robust, utilizing multiple datasets (LibriSpeechMix, CSJMix, AMI) to validate the effectiveness of CALM across different languages and conditions. The reported results demonstrate substantial improvements in biased and unbiased word error rates, showcasing the framework's ability to enhance ASR performance in multi-speaker contexts. The use of various biasing list sizes and the detailed analysis of results provide a thorough evaluation of the framework's capabilities.
The paper provides sufficient implementation details, including architecture specifications, training procedures, and evaluation metrics. However, the lack of a public repository or demo URL limits the ease of reproducibility for external researchers. Clearer guidelines or access to the code would enhance the paper's reproducibility.
While CALM shows promising results, the paper acknowledges challenges such as increased insertion errors in conversational datasets like AMI, particularly for short utterances. The reliance on enrollment utterances may also limit practical applications in real-world scenarios where such data may not be readily available. Additionally, the performance degradation observed in certain conditions suggests that further optimization is needed for broader applicability.
The integration of acoustic and linguistic modeling in CALM has significant implications for personalized AI applications, particularly in multi-speaker ASR settings such as meetings and discussions. The advancements made could lead to more accurate transcription services, enhancing accessibility and usability in various domains, including education, business, and healthcare. The paper presents CALM, a pioneering framework that effectively combines acoustic and linguistic cues for improved multi-speaker ASR performance. This comprehensive analysis highlights the framework's innovative methodology, rigorous experimental validation, and potential impact on the field of speech recognition.
Recent advances have demonstrated the potential of decoderonly large language models (LLMs) for automatic speech recognition (ASR). However, enabling streaming recognition within this framework remains a challenge. In this work, we propose a novel streaming ASR approach that integrates a read/write policy network with monotonic chunkwise attention (MoChA) to dynamically segment speech embeddings. These segments are interleaved with label sequences during training, enabling seamless integration with the LLM. During inference, the audio stream is buffered until the MoChA module triggers a read signal, at which point the buffered segment together with the previous token is fed into the LLM for the next token prediction. We also introduce a minimal-latency training objective to guide the policy network toward accurate segmentation boundaries. Furthermore, we adopt a joint training strategy in which a non-streaming LLM-ASR model and our streaming model share parameters. Experiments on the AISHELL-1 and AISHELL-2 Mandarin benchmarks demonstrate that our method consistently outperforms recent streaming ASR baselines, achieving character error rates of 5.1% and 5.5%, respectively. The latency optimization results in a 62.5% reduction in average token generation delay with negligible impact on recognition accuracy
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Shaanxi Normal University, iFLYTEK Co, iFLYTEK Research
This paper presents a novel approach to streaming speech recognition that integrates large language models with advanced segmentation techniques, significantly improving both latency and accuracy in ASR systems. The comprehensive methodology and strong experimental results position this work as a meaningful contribution to the field of machine learning and speech recognition.
The proposed methodology leverages a read/write policy network integrated with monotonic chunkwise attention (MoChA) to facilitate real-time streaming ASR. This innovative approach allows for dynamic segmentation of audio inputs, which is a significant advancement over traditional methods that often rely on fixed-size audio chunks. The introduction of a minimal-latency training objective to optimize the segmentation boundaries is particularly noteworthy, as it addresses a critical challenge in streaming ASR systems. The joint training strategy that shares parameters between streaming and non-streaming models is also a clever way to enhance efficiency and performance.
The experiments conducted on the AISHELL-1 and AISHELL-2 Mandarin benchmarks are comprehensive and demonstrate the effectiveness of the proposed method. The reported character error rates (CER) of 5.1% and 5.5% are competitive, and the significant reduction in average token generation delay (62.5%) highlights the practical benefits of the approach. The use of ablation studies to validate the contributions of different components of the model adds rigor to the experimental evaluation.
The paper provides sufficient details regarding the model architecture, training strategy, and experimental setup, which should allow for reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the results.
One limitation of the study is the focus on Mandarin datasets, which may restrict the generalizability of the findings to other languages or dialects. Additionally, while the model shows promising results, the trade-off between latency and accuracy could be further explored, particularly in more diverse real-world scenarios.
The advancements in streaming ASR have significant implications for applications such as real-time transcription, live captioning, and interactive voice response systems. The ability to reduce latency while maintaining accuracy can enhance user experience in various settings, including education, customer service, and accessibility for individuals with hearing impairments. This paper presents a novel approach to streaming speech recognition that integrates large language models with advanced segmentation techniques, significantly improving both latency and accuracy in ASR systems. The comprehensive methodology and strong experimental results position this work as a meaningful contribution to the field of machine learning and speech recognition.
Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.
Primary: Communication University of China
All Institutions: Ant Group, Communication University of China, Key Laboratory of Media Audio, Ministry of Education, State Key Laboratory of Media Convergence and Communication
The main contribution of this paper is the introduction of SDD-APALLM, a novel framework that enhances speech deepfake detection by explicitly exposing fine-grained acoustic evidence, thereby improving model robustness and interpretability. This work addresses a significant gap in the current methodologies for audio LLMs, providing a promising direction for future research in the field of audio processing and deepfake detection.
The proposed methodology, SDD-APALLM, innovatively enhances the accessibility of fine-grained acoustic evidence by integrating structured time-frequency representations alongside raw audio inputs. This approach effectively shifts the focus from semantic plausibility to acoustically grounded evidence, addressing a critical limitation in existing audio LLM-based speech deepfake detection methods. The use of Constant-Q Transform (CQT) to create visual tokens that highlight spectral structures linked to speech synthesis artifacts is particularly noteworthy, as it provides a clear mechanism for improving model interpretability and robustness.
The experiments are comprehensive, involving both in-domain and cross-domain evaluations across multiple datasets (ASVspoof2019 LA and ASVspoof2021 LA). The results demonstrate significant improvements in detection accuracy and robustness when utilizing the proposed framework, particularly under conditions where traditional models struggle. The ablation studies effectively illustrate the contributions of different modalities and reinforce the claim that explicit acoustic evidence enhances performance.
The paper provides detailed implementation information, including model architecture, training objectives, and hyperparameters, which supports reproducibility. However, the absence of a publicly accessible code repository or demo limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world audio deepfakes. Additionally, while the approach improves robustness, it may still be susceptible to novel spoofing techniques that exploit different acoustic characteristics not covered in the training data.
The implications of this research extend to various applications in security and trustworthiness of speech-based systems, such as voice authentication and content verification. By improving the detection of speech deepfakes, this work contributes to safeguarding against misinformation and enhancing the integrity of audio communications. The main contribution of this paper is the introduction of SDD-APALLM, a novel framework that enhances speech deepfake detection by explicitly exposing fine-grained acoustic evidence, thereby improving model robustness and interpretability. This work addresses a significant gap in the current methodologies for audio LLMs, providing a promising direction for future research in the field of audio processing and deepfake detection.
Achieving precise and controllable emotional expression is crucial for producing natural and context-appropriate speech in text-to-speech (TTS) synthesis. However, many emotion-aware TTS systems, including large language model (LLM)-based designs, rely on scaling fixed emotion embeddings or external guidance, limiting their ability to model emotion-specific latent characteristics. To address this gap, we present EmoShift, a lightweight activation-steering framework incorporating a EmoSteer layer, which learns a steering vector for each target emotion in the output embedding space to capture its latent offset and maintain stable, appropriate expression across utterances and categories. With only 10M trainable parameters,less than 1/30 of full fine-tuning, EmoShift outperforms zero-shot and fully fine-tuned baselines in objective and subjective evaluations, enhancing emotional expressiveness while preserving naturalness and speaker similarity. Further analysis confirms the proposed EmoSteer layer's effectiveness and reveals its potential for controllable emotional intensity in speech synthesis.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of EmoShift, a lightweight activation-steering framework that significantly enhances emotional expressiveness in TTS systems while maintaining naturalness and speaker similarity. This work represents a meaningful advancement in the field of emotion-aware speech synthesis, addressing critical limitations of existing approaches and providing a foundation for future research in emotional control in TTS.
The proposed EmoShift framework introduces a novel EmoSteer layer that learns emotion-specific steering vectors, allowing for precise emotional control in TTS without retraining the base model. The methodology is well-structured, leveraging activation steering to inject emotion-specific offsets in a plug-and-play manner. This approach is innovative as it addresses the limitations of existing emotion-aware TTS systems that rely on fixed emotion embeddings or external guidance. The model's architecture is designed to be model-agnostic, which enhances its applicability across various TTS systems. The integration of objective and subjective evaluations to assess performance is commendable, providing a holistic view of the model's effectiveness.
The experimental setup is robust, utilizing a well-defined dataset (ESD) and comparing EmoShift against strong baselines, including a fully fine-tuned model and a model with the EmoSteer layer. The results demonstrate significant improvements in emotional expressiveness while maintaining naturalness and speaker similarity. The use of both objective metrics (WER, SpkSIM, DNSMOS) and subjective metrics (MOS, Emo-MOS) strengthens the evaluation, confirming the model's capabilities across multiple dimensions of TTS performance.
The paper provides sufficient details regarding the experimental setup, including training parameters, dataset partitioning, and evaluation metrics, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation is the reliance on a specific dataset (ESD), which may affect the generalizability of the results to other languages or emotional contexts. Additionally, while the EmoSteer layer shows promise for emotional control, the paper does not explore the impact of using more diverse or compound emotions, which could enhance the model's applicability in real-world scenarios.
The EmoShift framework has significant implications for applications in virtual assistants, audiobooks, and human-machine dialogue systems, where emotional expressiveness is crucial for user engagement and interaction quality. By enabling more nuanced emotional control in TTS, this work could enhance user experiences in various domains, including education, entertainment, and accessibility. The main contribution of this paper is the introduction of EmoShift, a lightweight activation-steering framework that significantly enhances emotional expressiveness in TTS systems while maintaining naturalness and speaker similarity. This work represents a meaningful advancement in the field of emotion-aware speech synthesis, addressing critical limitations of existing approaches and providing a foundation for future research in emotional control in TTS.
Recent advances in Large Audio Language Models (LALMs) have extended Text-to-Speech (TTS) to interactive role-play scenarios, which demand high expressiveness and strict adherence to role-play instructions. However, existing models struggle to maintain stylistic consistency with character profiles and scene descriptions across multi-turn dialogues. A critical bottleneck is the lack of objective metrics for quantifying speaking style. To bridge this gap, we propose Mean Continuation Log-Probability (MCLP) as both an evaluation metric and a reward signal, validated on LALM-based Role-Play TTS (RP-TTS) tasks. Critically, we leverage the In-Context Learning capability of pre-trained LALMs to formulate MCLP via a continuation log-probability prediction. This metric quantifies stylistic consistency by measuring the likelihood of the ground-truth speech conditioned on the generated speech. Furthermore, we employ MCLP as a reinforcement learning reward to enhance the style alignment between generated speech and Role-Play instructions. To facilitate evaluation, we construct an RP-TTS dataset with rich scene and character annotations. Experimental results demonstrate that our method significantly outperforms strong LALM baselines on both objective and subjective metrics.
Primary: University of Chinese Academy of Sciences
All Institutions: University of Chinese Academy of Sciences, Beihang University, StepFun
The paper presents a significant contribution to the field of machine learning by addressing the challenge of stylistic consistency in role-play TTS through the innovative use of MCLP and a hybrid reward mechanism. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful advancement in the capabilities of TTS systems.
The paper introduces a novel metric, Mean Continuation Log-Probability (MCLP), which quantifies stylistic consistency in TTS systems using the capabilities of pre-trained Large Audio Language Models (LALMs). The methodology is well-structured, combining supervised fine-tuning (SFT) and reinforcement learning (RL) to optimize TTS for role-play scenarios. The integration of MCLP as both an evaluation metric and a reward signal is innovative, providing a more nuanced approach to measuring stylistic adherence in generated speech. The use of a hybrid reward function that balances style and content fidelity is a significant advancement in addressing the challenges of role-play TTS.
The experiments are comprehensive, utilizing a newly constructed RP-TTS dataset with rich annotations that enhance the evaluation of the proposed method. The results demonstrate significant improvements over strong baselines in both objective and subjective metrics, indicating the effectiveness of MCLP in real-world applications. The paper includes rigorous ablation studies that validate the necessity of each component of the proposed method, further strengthening the experimental findings.
While the paper provides detailed descriptions of the methodology and experimental setup, it lacks specific implementation details and code availability, which could hinder reproducibility. The absence of a demo or project URL further complicates efforts to replicate the results.
One limitation is the reliance on subjective evaluations, which can introduce variability based on annotator interpretation. Additionally, the paper does not address potential biases in the dataset construction process, which could affect the generalizability of the findings. The hybrid reward formulation, while innovative, may also lead to complexities in tuning the reward parameters effectively.
The advancements in expressive TTS systems have significant implications for various applications, including gaming, virtual assistants, and interactive storytelling. By improving the ability of TTS systems to maintain stylistic consistency, this work could enhance user engagement and experience in interactive media. The paper presents a significant contribution to the field of machine learning by addressing the challenge of stylistic consistency in role-play TTS through the innovative use of MCLP and a hybrid reward mechanism. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful advancement in the capabilities of TTS systems.
Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding. Our code is available at https://github.com/NKU-HLT/DIFFA.git.
Primary: Meituan
All Institutions: Meituan
The main contribution of this paper is the introduction of DIFFA-2, a diffusion-based large audio language model that significantly enhances audio understanding capabilities through innovative training methodologies and architectures. This work represents a meaningful step forward in the field of audio processing and understanding, showcasing the potential of diffusion models in a domain traditionally dominated by autoregressive approaches.
The methodology is robust, introducing a four-stage training curriculum that effectively combines semantic and acoustic alignment, large-scale supervised fine-tuning, and preference optimization. The dual-adapter architecture and the use of a frozen Whisper encoder are innovative, allowing for effective audio understanding. The paper also employs variance-reduced preference optimization, which is a notable contribution to the training process of diffusion models.
The experiments are comprehensive, utilizing multiple benchmarks (MMSU, MMAU, MMAR) to evaluate the model's performance across various dimensions of audio understanding. The results indicate that DIFFA-2 consistently outperforms its predecessor and competes well with strong autoregressive models, demonstrating the effectiveness of the proposed methods.
The paper provides sufficient details about the training and inference setup, including the datasets used and the training pipeline. However, the reproducibility could be enhanced with more explicit descriptions of hyperparameters and model configurations.
The paper acknowledges limitations in its training focus, particularly regarding conversational and alignment-style supervision, which affects performance on dialogue-centric benchmarks. Additionally, the model's performance on mixed-modality tasks is not as strong, indicating areas for improvement.
The advancements in audio understanding through DIFFA-2 have significant implications for applications in interactive voice assistants, audio analysis, and multimedia content understanding. The open-sourcing of the code and training pipeline also promotes further research in this area. The main contribution of this paper is the introduction of DIFFA-2, a diffusion-based large audio language model that significantly enhances audio understanding capabilities through innovative training methodologies and architectures. This work represents a meaningful step forward in the field of audio processing and understanding, showcasing the potential of diffusion models in a domain traditionally dominated by autoregressive approaches.
We present a deep neural network approach for encoding microphone array signals into Ambisonics that generalizes to arbitrary microphone array configurations with fixed microphone count but varying locations and frequency-dependent directional characteristics. Unlike previous methods that rely only on array geometry as metadata, our approach uses directional array transfer functions, enabling accurate characterization of real-world arrays. The proposed architecture employs separate encoders for audio and directional responses, combining them through cross-attention mechanisms to generate array-independent spatial audio representations. We evaluate the method on simulated data in two settings: a mobile phone with complex body scattering, and a free-field condition, both with varying numbers of sound sources in reverberant environments. Evaluations demonstrate that our approach outperforms both conventional digital signal processing-based methods and existing deep neural network solutions. Furthermore, using array transfer functions instead of geometry as metadata input improves accuracy on realistic arrays.
Primary: unknown
All Institutions: unknown
This paper presents a significant advancement in the encoding of spatial audio through a novel neural architecture that leverages cross-attention mechanisms and directional ATFs, demonstrating strong performance in challenging acoustic environments. The methodology and results contribute meaningfully to the field of audio processing and spatial audio technologies.
The paper introduces a novel deep neural network architecture that effectively encodes microphone array signals into Ambisonics using directional array transfer functions (ATFs) and cross-attention mechanisms. The separation of encoders for audio and directional responses is a significant methodological advancement, allowing for the generation of array-independent spatial audio representations. The use of cross-attention to combine features from different modalities is well-justified and aligns with contemporary trends in multi-modal learning. However, the paper could benefit from a clearer explanation of the architecture's design choices and the rationale behind specific hyperparameter selections.
The evaluation of the proposed method is thorough, utilizing simulated data across two distinct environments: a mobile phone scenario with body scattering and a free-field condition. The comparative analysis against traditional DSP methods and existing neural solutions is robust, demonstrating clear performance improvements in terms of scale-invariant signal-to-distortion ratio (SI-SDR) and other Ambisonics metrics. The results are well-presented, though additional qualitative assessments, such as listening tests, would strengthen the findings.
The paper provides a detailed description of the experimental setup, including data generation, training procedures, and evaluation metrics. However, the absence of a publicly accessible code repository or demo limits reproducibility. Future work should include sharing the implementation to facilitate validation and further exploration by the community.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world scenarios. Additionally, while the model shows promising results, its generalization capabilities to various real-world microphone configurations and environments remain to be thoroughly tested. The paper also mentions that the model's performance could be enhanced by increasing the learning capacity of the encoders and decoder, indicating potential avenues for future research.
The proposed method has significant implications for spatial audio applications, particularly in immersive communication and virtual/extended reality environments. By improving the encoding of microphone array signals, this work could enhance user experiences in various consumer devices, making it relevant for industries focused on audio technology and immersive media. The ability to generalize across different microphone configurations also opens up possibilities for broader adoption in diverse applications. This paper presents a significant advancement in the encoding of spatial audio through a novel neural architecture that leverages cross-attention mechanisms and directional ATFs, demonstrating strong performance in challenging acoustic environments. The methodology and results contribute meaningfully to the field of audio processing and spatial audio technologies.
To advance immersive communication, the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge recently introduced Task 4 on Spatial Semantic Segmentation of Sound Scenes (S5). An S5 system takes a multi-channel audio mixture as input and outputs single-channel dry sources along with their corresponding class labels. Although the DCASE 2025 Challenge simplifies the task by constraining class labels in each mixture to be mutually exclusive, real-world mixtures frequently contain multiple sources from the same class. The presence of duplicated labels can significantly degrade the performance of the label-queried source separation (LQSS) model, which is the key component of many existing S5 systems, and can also limit the validity of the official evaluation metric of DCASE 2025 Task 4. To address these issues, we propose a class-aware permutation-invariant loss function that enables the LQSS model to handle queries involving duplicated labels. In addition, we redesign the S5 evaluation metric to eliminate ambiguities caused by these same-class sources. To evaluate the proposed method within the S5 system, we extend the label prediction model to support same-class labels. Experimental results demonstrate the effectiveness of the proposed methods and the robustness of the new metric on mixtures both with and without same-class sources.
Primary: unknown
All Institutions: JST Strategic International Collaborative Research Program (SICORP)
This paper presents a novel approach to handling duplicated labels in sound source separation, significantly improving the performance of systems designed for complex audio environments. The technical contributions are well-articulated, and the proposed methodologies could set a new standard in the field of audio processing and immersive communication.
The paper proposes a class-aware permutation-invariant loss function that effectively addresses the challenges posed by duplicated labels in sound source separation tasks. The methodology is well-structured, introducing modifications to existing models and metrics to enhance performance in real-world scenarios where multiple sources from the same class are present. The approach is innovative in its use of permutation-invariant training tailored to the specific context of audio segmentation, which is a significant advancement over traditional methods that do not account for label duplication.
The experiments are comprehensive, utilizing a well-defined dataset that simulates real-world conditions. The authors provide a detailed analysis of the performance of their proposed system compared to existing methods, demonstrating significant improvements in handling same-class sources. However, the paper could benefit from additional comparisons with more diverse models and datasets to further validate the robustness of the proposed approach.
The paper mentions that the source code will be released as part of the baseline system for the DCASE 2026 Challenge, which is a positive step towards reproducibility. However, the lack of specific URLs for the code repository and demo limits the immediate accessibility of the implementation details.
The paper acknowledges that the performance of the audio tagging model is still limited when estimating the number of sources and their labels simultaneously, particularly in the presence of multiple sources from the same class. Additionally, the reliance on oracle labels during training may not fully reflect real-world applications where such labels are not available.
The proposed methods have significant implications for immersive communication technologies and audio processing applications, particularly in environments where multiple sound sources coexist. The advancements in sound source separation could enhance user experiences in virtual and augmented reality applications, as well as improve accessibility in audio-based communication systems. This paper presents a novel approach to handling duplicated labels in sound source separation, significantly improving the performance of systems designed for complex audio environments. The technical contributions are well-articulated, and the proposed methodologies could set a new standard in the field of audio processing and immersive communication.
Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at https://github.com/realzhouchushu/CAT.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, Shanghai Jiao Tong University
The main contribution of this paper is the introduction of the Convolutional Audio Transformer (CAT), which effectively addresses the limitations of existing self-supervised learning methods in audio understanding by incorporating a multi-resolution approach and representation regularization. This work represents a meaningful step forward in the field, combining innovative methodology with rigorous experimental validation to enhance audio representation learning.
The proposed Convolutional Audio Transformer (CAT) introduces a Multi-resolution Block to capture hierarchical audio features, which is a significant advancement over existing methods that typically operate at a single level of granularity. The incorporation of a Representation Regularization objective is innovative, as it aligns the student model's predictions with high-quality semantic representations from pre-trained external encoders. This approach not only enhances the model's training efficiency but also bridges the gap between audio and language representations, which is a novel contribution to the field of audio understanding.
The experiments conducted on multiple audio understanding benchmarks, including AudioSet, ESC-50, and Speech Commands V2, demonstrate the effectiveness of CAT. The reported results show significant improvements over baseline models, particularly in terms of convergence speed and performance metrics. The use of various datasets and the comparison against state-of-the-art models strengthen the credibility of the findings. However, more details on the experimental setup and statistical significance of the results would enhance the evaluation.
The paper mentions that codes and checkpoints will be released, which is a positive aspect for reproducibility. However, the detailed hyperparameter settings and training configurations provided in the tables are essential for others to replicate the experiments accurately. The clarity of these details is crucial for ensuring that the research can be reproduced by the community.
One limitation is the reliance on pre-trained external encoders, which may limit the model's applicability in scenarios where such resources are not available. Additionally, while the model shows improved performance, the computational efficiency and scalability of the approach in real-world applications need further exploration. The paper could also benefit from a more thorough discussion on the potential biases in the datasets used.
The advancements made in audio understanding through the CAT framework have significant implications for various applications, including automated audio captioning, sound event detection, and human-computer interaction. By improving the efficiency and effectiveness of audio representation learning, this research could lead to more robust audio processing systems in diverse domains such as entertainment, surveillance, and accessibility technologies. The main contribution of this paper is the introduction of the Convolutional Audio Transformer (CAT), which effectively addresses the limitations of existing self-supervised learning methods in audio understanding by incorporating a multi-resolution approach and representation regularization. This work represents a meaningful step forward in the field, combining innovative methodology with rigorous experimental validation to enhance audio representation learning.
In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: https://semanticaudio1.github.io/
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Shanghai Jiao Tong University
The main contribution of this work is the introduction of the SemanticAudio framework, which decouples semantic planning from acoustic synthesis, achieving superior semantic alignment and enabling training-free audio editing. This innovative approach addresses critical limitations in existing text-to-audio generation models and has the potential to significantly impact the field of audio synthesis and editing.
The proposed SemanticAudio framework introduces a two-stage Flow Matching architecture that effectively separates the semantic planning of audio content from the acoustic synthesis process. This decoupling allows for improved semantic alignment with textual prompts, addressing a significant limitation in existing models that operate directly in acoustic latent spaces. The methodology is well-structured, leveraging pre-trained models for both semantic and acoustic representations, and introduces a novel training-free editing mechanism that enhances user control over audio attributes. The use of velocity fields for guiding the generation process is particularly innovative and demonstrates a solid understanding of the underlying principles of generative modeling.
The experiments conducted are extensive and rigorously designed, utilizing the AudioCaps dataset to evaluate both text-to-audio generation and training-free editing capabilities. The paper provides clear metrics for assessing performance, including CLAP scores for semantic alignment, Fréchet Distance for fidelity, and Inception Score for diversity. The results indicate that SemanticAudio outperforms existing state-of-the-art methods, validating the proposed approach. However, the reliance on a single dataset for training and evaluation may limit the generalizability of the findings.
The paper includes detailed implementation specifics, including architecture choices, training protocols, and evaluation metrics, which facilitate reproducibility. The use of established frameworks and pre-trained models further aids in replicating the results. However, the absence of a public code repository may hinder full reproducibility for some researchers.
The paper acknowledges limitations related to the dataset size and the potential challenges in generalizing the model to longer audio sequences or more complex acoustic scenarios. Additionally, the evaluation of editing capabilities relies on proxy metrics, which may not fully capture the subjective quality of the audio modifications. Future work is needed to address these limitations and explore broader datasets.
The SemanticAudio framework has significant implications for various applications in creative industries, such as film, gaming, and virtual reality, where high-quality audio generation and editing are crucial. The ability to manipulate audio attributes without retraining models can streamline workflows for sound designers and enhance user experiences in interactive environments. The research contributes to the growing field of generative audio models, pushing the boundaries of what is possible in text-to-audio synthesis. The main contribution of this work is the introduction of the SemanticAudio framework, which decouples semantic planning from acoustic synthesis, achieving superior semantic alignment and enabling training-free audio editing. This innovative approach addresses critical limitations in existing text-to-audio generation models and has the potential to significantly impact the field of audio synthesis and editing.
Scaling spoken language modeling requires speech tokens that are both efficient and universal. Recent work has proposed syllables as promising speech tokens at low temporal resolution, but existing models are constrained to English and fail to capture sufficient acoustic detail. To address this gap, we present Sylber 2.0, a self-supervised framework for coding speech at the syllable level that enables efficient temporal compression and high-fidelity reconstruction. Sylber 2.0 achieves a very low token frequency around 5 Hz, while retaining both linguistic and acoustic detail across multiple languages and expressive styles. Experiments show that it performs on par with previous models operating on high-frequency baselines. Furthermore, Sylber 2.0 enables efficient TTS modeling which can generate speech with competitive intelligibility and quality with SOTA models using only 72M parameters. Moreover, the universality of Sylber 2.0 provides more effective features for low resource ASR than previous speech coding frameworks. In sum, we establish an effective syllable-level abstraction for general spoken language.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Carnegie Mellon University
Sylber 2.0 presents a significant advancement in speech modeling by introducing a universal syllable embedding framework that efficiently captures linguistic and acoustic details across multiple languages. The comprehensive methodology, rigorous experimental evaluation, and potential for broad applications underscore its importance in the field of machine learning and audio processing.
The methodology presented in Sylber 2.0 is robust and innovative, leveraging self-supervised learning to create syllable embeddings that effectively capture both linguistic and acoustic details across multiple languages. The introduction of a boundary detector and an auxiliary acoustic encoder enhances the model's ability to generate high-fidelity speech while maintaining a low token frequency. The multi-stage training process and the careful design of the encoding-decoding framework demonstrate a thorough understanding of the challenges in speech modeling.
The experiments conducted are comprehensive and well-structured, covering a wide range of languages and styles. The results indicate that Sylber 2.0 achieves competitive performance in terms of intelligibility and quality compared to state-of-the-art models, even with a significantly reduced parameter count. The evaluation metrics used, such as WER and STOI, provide a clear picture of the model's effectiveness in real-world applications.
The paper provides detailed implementation details, including training data sources and hyperparameter settings, which enhance reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to reproduce the results independently.
One limitation is the reliance on the quality of the training data, as the model's performance may vary significantly with different datasets. Additionally, while the model is designed for multilingual applications, the performance in low-resource languages could be further explored to assess its generalizability. The potential for misuse in generating misleading audio also raises ethical concerns that need to be addressed.
The implications of this research are significant, particularly in the fields of text-to-speech (TTS) and automatic speech recognition (ASR). By providing a more efficient and universal method for speech tokenization, Sylber 2.0 could enhance accessibility and usability in various applications, including language learning, assistive technologies, and multilingual communication. However, ethical considerations regarding the misuse of synthesized speech must be taken into account. Sylber 2.0 presents a significant advancement in speech modeling by introducing a universal syllable embedding framework that efficiently captures linguistic and acoustic details across multiple languages. The comprehensive methodology, rigorous experimental evaluation, and potential for broad applications underscore its importance in the field of machine learning and audio processing.
The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field's reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice, and specifically curated to isolate the effect of language switching across approximately 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate on cross-language trials. By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies, directly aligning with the Interspeech 2026 theme, "Speaking Together."
Primary: University of Zurich
All Institutions: University of Zurich, Indiana University, Mozilla Foundation, Otto-von-Guericke-University Magdeburg
The TidyVoice Challenge aims to advance cross-lingual speaker verification research by providing a structured evaluation framework and a curated multilingual dataset. This comprehensive analysis highlights the challenge's innovative approach, rigorous methodology, and potential implications for the field of machine learning.
The methodology proposed in the TidyVoice Challenge is well-structured and addresses a significant gap in speaker verification research, particularly focusing on cross-lingual scenarios. The challenge is designed to evaluate systems under controlled conditions with a clear definition of tasks, training, and test conditions. The use of the TidyVoiceX dataset, which is specifically curated to isolate language switching effects, adds robustness to the methodology. The evaluation metrics, including Equal Error Rate (EER) and Minimum Detection Cost Function (minDCF), are appropriate for the task and provide a comprehensive assessment of system performance.
The paper outlines a rigorous evaluation plan that includes a clear delineation of training and evaluation phases, as well as the use of a baseline system for comparison. The challenge's design ensures that participants are tested on their ability to generalize to unseen languages, which is critical for assessing the robustness of speaker verification systems. However, the paper does not provide empirical results or preliminary findings, which could have strengthened the evaluation of the proposed challenge.
The challenge emphasizes reproducibility by requiring participants to submit detailed system descriptions and trained models. This is a positive aspect, as it encourages transparency and allows for independent verification of results. The provision of a baseline system and evaluation scripts further enhances reproducibility, although the actual implementation details of the baseline system are not fully elaborated in the paper.
One limitation of the challenge is the reliance on the Mozilla Common Voice dataset, which may have inherent biases or limitations in terms of speaker diversity and language representation. Additionally, the challenge does not address potential issues related to the quality of the audio recordings, which could impact the performance of the systems developed by participants.
The TidyVoice Challenge has the potential to significantly impact the field of speaker verification by promoting research that is more inclusive and representative of diverse languages. By focusing on cross-lingual verification, the challenge aligns with broader goals of fairness and accessibility in machine learning technologies. The outcomes of this challenge could lead to advancements in language-independent speaker recognition systems, benefiting various applications in security, telecommunications, and human-computer interaction. The TidyVoice Challenge aims to advance cross-lingual speaker verification research by providing a structured evaluation framework and a curated multilingual dataset. This comprehensive analysis highlights the challenge's innovative approach, rigorous methodology, and potential implications for the field of machine learning.
While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model (LLM)-based agent for post-ASR correction: a Judge-Editor over the top-k ASR hypotheses that keeps high-confidence spans, rewrites uncertain segments, and operates in both zero-shot and fine-tuned modes. In parallel, we release SAP-Hypo5, the largest benchmark for dysarthric speech correction, to enable reproducibility and future exploration. Under multi-perspective evaluation, our agent achieves a 14.51% WER reduction alongside substantial semantic gains, including a +7.59 pp improvement in MENLI and +7.66 pp in Slot Micro F1 on challenging samples. Our analysis further reveals that WER is highly sensitive to domain shift, whereas semantic metrics correlate more closely with downstream task performance.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign
This paper presents a significant advancement in the field of dysarthric speech recognition by proposing a robust LLM-based post-ASR correction method that prioritizes semantic fidelity over traditional metrics. The combination of innovative methodology and comprehensive evaluation positions this work as a valuable contribution to both the machine learning and speech recognition communities.
The paper introduces a novel approach to post-ASR correction for dysarthric speech using a Large Language Model (LLM) as a Judge-Editor. This method is significant as it operates on the top-k ASR hypotheses, allowing for the retention of high-confidence segments while rewriting uncertain parts. The dual operational modes (zero-shot and fine-tuned) enhance its applicability across various scenarios. The integration of semantic fidelity metrics alongside traditional WER represents a meaningful shift in how ASR systems are evaluated, particularly for populations with unique speech characteristics.
The authors provide a comprehensive evaluation of their method using the newly released SAP-Hypo5 dataset, which is the largest benchmark for dysarthric speech correction. The reported 14.51% reduction in WER, alongside improvements in semantic metrics (MENLI and Slot Micro F1), indicates robust experimental design and results. The multi-perspective evaluation approach adds depth to the analysis, showing that traditional metrics can be misleading in specific contexts, particularly for dysarthric speech.
The authors emphasize the importance of reproducibility by releasing the SAP-Hypo5 dataset, which is crucial for future research in this area. However, the paper lacks specific details regarding the implementation of the LLM-agent and whether the code or models will be made publicly available, which could hinder full reproducibility.
While the paper presents a strong methodology and results, it does not address potential limitations in the generalizability of the LLM-agent across different dialects or languages of dysarthric speech. Additionally, the reliance on top-k hypotheses may introduce biases based on the ASR system used, which could affect the outcomes.
The implications of this research are significant, particularly for improving communication aids for individuals with dysarthria. By enhancing the accuracy of ASR systems in this context, the work could lead to better accessibility tools, ultimately improving the quality of life for affected individuals. The focus on semantic fidelity also sets a precedent for future research in ASR applications beyond dysarthric speech. This paper presents a significant advancement in the field of dysarthric speech recognition by proposing a robust LLM-based post-ASR correction method that prioritizes semantic fidelity over traditional metrics. The combination of innovative methodology and comprehensive evaluation positions this work as a valuable contribution to both the machine learning and speech recognition communities.
Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.
Primary: Wuhan University
All Institutions: Wuhan University, Anhui University, Communication University of China, Beihang University, Independent Researcher
The paper presents a comprehensive approach to speech editing detection and content localization through the development of the PELM framework, significantly advancing the field of audio processing and detection. The innovative methodology, combined with robust experimental validation, positions this work as a valuable contribution to combating the challenges posed by advanced audio manipulation techniques.
The paper introduces a novel framework, PELM, that combines speech editing detection and content localization by treating them as an audio question answering task. The incorporation of a word-level probabilistic prior and an acoustic consistency-aware loss is innovative, addressing biases in existing audio large language models. The methodology is well-structured, leveraging a large-scale bilingual dataset (AiEdit) that enhances the robustness of the model against advanced speech editing techniques.
The experiments are thorough, comparing PELM against several state-of-the-art methods on both the HumanEdit and AiEdit datasets. The results demonstrate significant improvements in detection and localization tasks, with detailed metrics such as Equal Error Rate (EER) showing PELM's superiority. The ablation studies provide insights into the contributions of each component of the framework, reinforcing the validity of the proposed methods.
The paper provides sufficient implementation details, including model architectures, training configurations, and hyperparameters, which support reproducibility. The authors have also made the dataset publicly available, facilitating further research in this area.
One limitation is the reliance on the quality of the underlying large language models, which may affect the performance of PELM. Additionally, while the dataset is extensive, it may not cover all possible speech editing scenarios, potentially limiting the generalizability of the findings.
The research addresses critical issues related to audio deepfakes and misinformation, making it highly relevant in today's digital landscape. The ability to detect and localize speech edits has significant implications for security, privacy, and the integrity of information dissemination. The paper presents a comprehensive approach to speech editing detection and content localization through the development of the PELM framework, significantly advancing the field of audio processing and detection. The innovative methodology, combined with robust experimental validation, positions this work as a valuable contribution to combating the challenges posed by advanced audio manipulation techniques.
We propose a brain-informed speech separation method for cochlear implants (CIs) that uses electroencephalography (EEG)-derived attention cues to guide enhancement toward the attended speaker. An attention-guided network fuses audio mixtures with EEG features through a lightweight fusion layer, producing attended-source electrodograms for CI stimulation while resolving the label-permutation ambiguity of audio-only separators. Robustness to degraded attention cues is improved with a mixed curriculum that varies cue quality during training, yielding stable gains even when EEG-speech correlation is moderate. In multi-talker conditions, the model achieves higher signal-to-interference ratio improvements than an audio-only electrodogram baseline while remaining slightly smaller (167k vs. 171k parameters). With 2 ms algorithmic latency and comparable cost, the approach highlights the promise of coupling auditory and neural cues for cognitively adaptive CI processing.
Primary: unknown
All Institutions: unknown
The paper presents a novel brain-informed speech separation method for cochlear implants, demonstrating significant improvements over traditional audio-only approaches. The integration of EEG-derived attention cues and a robust training methodology highlights its potential to enhance speech intelligibility in complex auditory environments, marking a meaningful contribution to the field of machine learning and auditory processing.
The proposed methodology integrates EEG-derived attention cues with audio processing in a lightweight neural network architecture, addressing a significant challenge in cochlear implant (CI) technology. The attention-guided network effectively resolves label-permutation ambiguity by producing a single attended electrodogram, which is a notable advancement over traditional audio-only approaches. The use of curriculum learning to enhance robustness against degraded cues is a clever strategy that reflects a deep understanding of the practical challenges in real-world applications. However, the reliance on a proxy attention cue rather than real EEG data is a limitation that could affect the generalizability of the results.
The experimental evaluation is thorough, comparing the proposed model against a strong baseline in various conditions. The results demonstrate significant improvements in signal-to-interference ratio (SIR) across different input conditions, indicating the effectiveness of the proposed method. The analysis of cue correlation and its impact on performance provides valuable insights into the robustness of the model. However, the experiments could benefit from additional real-world testing with actual CI users to validate the findings further.
The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, along with a link to the open-source implementation. This transparency enhances reproducibility, allowing other researchers to replicate the study and build upon the findings. However, the absence of real EEG data in the training and evaluation phases may limit the reproducibility of results in practical scenarios.
Key limitations include the use of a proxy attention cue instead of real EEG data, which may not fully capture the complexities of actual neural signals. Additionally, while the mixed curriculum learning approach shows promise, the model's performance in highly variable real-world environments remains untested. Future work should address these limitations by incorporating real EEG data and evaluating the model's performance in more complex auditory scenes.
The research has significant implications for improving speech perception in cochlear implant users, particularly in challenging listening environments such as multi-talker scenarios. By leveraging brain-computer interface techniques, this work opens avenues for more cognitively adaptive auditory processing systems, potentially enhancing the quality of life for individuals with hearing impairments. The findings could also inspire further research into multimodal integration in various applications beyond cochlear implants. The paper presents a novel brain-informed speech separation method for cochlear implants, demonstrating significant improvements over traditional audio-only approaches. The integration of EEG-derived attention cues and a robust training methodology highlights its potential to enhance speech intelligibility in complex auditory environments, marking a meaningful contribution to the field of machine learning and auditory processing.
Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is further adopted to achieve optimal phonetic accuracy. Third, we achieve a single-step efficient reverse process in inference with a novel quantization error mask initialization strategy, which, according to our knowledge, is the first successful single-step diffusion speech enhancement based on an audio codec. Trained and evaluated on URGENT 2024 Speech Enhancement Challenge data splits, the proposed DisContSE excels top-reported time- and frequency-domain diffusion baseline methods in PESQ, POLQA, UTMOS, and in a subjective ITU-T P.808 listening test, clearly achieving an overall top rank.
Primary: unknown
All Institutions: unknown
The paper presents DisContSE, a novel diffusion-based speech enhancement model that effectively integrates discrete codec tokens and continuous embeddings, achieving state-of-the-art results while significantly reducing inference complexity. This contribution is poised to advance the field of speech processing, particularly in enhancing audio quality in real-time applications.
The methodology is well-structured, combining discrete and continuous embeddings to enhance speech quality while reducing computational complexity. The introduction of a single-step reverse process is innovative and addresses a significant limitation in existing diffusion models. The use of quantization error mask initialization is a novel approach that enhances the model's efficiency and effectiveness.
The experiments are thorough, utilizing a large-scale dataset and comparing against multiple state-of-the-art methods. The results demonstrate significant improvements across various metrics, indicating the robustness of the proposed model. The subjective listening tests add credibility to the findings.
The paper provides sufficient implementation details, including training configurations and metrics used, which aids in reproducibility. However, the lack of access to the actual code or model weights limits full reproducibility.
The paper does not address potential limitations in terms of generalizability across different languages or accents, nor does it discuss the computational requirements for real-time applications. Additionally, the subjective nature of some evaluation metrics may introduce bias.
The proposed model has significant implications for real-time speech enhancement applications, particularly in scenarios with low-quality audio inputs. Its efficiency could facilitate broader adoption in consumer electronics and assistive technologies. The paper presents DisContSE, a novel diffusion-based speech enhancement model that effectively integrates discrete codec tokens and continuous embeddings, achieving state-of-the-art results while significantly reducing inference complexity. This contribution is poised to advance the field of speech processing, particularly in enhancing audio quality in real-time applications.
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
Primary: Google DeepMind
All Institutions: Google DeepMind
The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
The methodology is robust, introducing PhaseCoder as a transformer-only spatial audio encoder that is microphone geometry-agnostic. The authors effectively leverage raw multichannel audio and microphone coordinates to produce spatial embeddings, which is a significant advancement over existing methods that are limited by fixed geometries. The use of a two-stage training strategy and synthetic data generation is well-justified, addressing the lack of real-world datasets. The architecture, including positional embeddings and the integration with the Gemma 3n LLM, is thoughtfully designed to enhance spatial reasoning capabilities.
The experimental evaluation is thorough, with clear benchmarks against state-of-the-art models like GI-DOAEnet. The results demonstrate that PhaseCoder achieves competitive performance on localization tasks, even outperforming existing models in certain scenarios. The evaluation of the fine-tuned LLM on spatial reasoning tasks is particularly noteworthy, showcasing the model's ability to handle complex queries related to spatial audio understanding. However, the reliance on synthetic datasets may raise questions about generalizability.
The paper provides detailed implementation details, including training configurations, data generation processes, and model architecture. While the methodology is well-documented, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing these resources to facilitate further research and validation.
The primary limitations include the assumption of static sources and the focus on single-speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, the model's performance could be impacted by the lack of explicit modeling of acoustic properties and dynamic sources. Future iterations should address these aspects to enhance robustness and applicability.
This work has significant implications for various applications, including assistive technologies for the hearing-impaired, improving human-robot interaction, and advancing embodied AI systems. By enabling spatial audio understanding across diverse devices, it promotes accessibility and adaptability in AI technologies, potentially transforming how users interact with their environments. The paper presents a pioneering approach to spatial audio understanding for multimodal LLMs, significantly advancing the field by enabling robust reasoning over spatial audio tokens. The combination of innovative methodology and thorough experimental evaluation positions this work as a critical contribution to the intersection of audio processing and language models.
We introduce Mix2Morph, a text-to-audio diffusion model fine-tuned to perform sound morphing without a dedicated dataset of morphs. By finetuning on noisy surrogate mixes at higher diffusion timesteps, Mix2Morph yields stable, perceptually coherent morphs that convincingly integrate qualities of both sources. We specifically target sound infusions, a practically and perceptually motivated subclass of morphing in which one sound acts as the dominant primary source, providing overall temporal and structural behavior, while a secondary sound is infused throughout, enriching its timbral and textural qualities. Objective evaluations and listening tests show that Mix2Morph outperforms prior baselines and produces high-quality sound infusions across diverse categories, representing a step toward more controllable and concept-driven tools for sound design. Sound examples are available at https://anniejchu.github.io/mix2morph .
Primary: Northwestern University
All Institutions: Northwestern University
Mix2Morph represents a substantial advancement in sound morphing techniques, leveraging innovative training strategies and augmentation methods to produce high-quality audio infusions. The comprehensive evaluation of its performance against existing models highlights its potential to enhance sound design practices significantly.
The paper presents a novel approach to sound morphing through the Mix2Morph model, which utilizes a finetuning strategy on noisy surrogate mixes. This method allows the model to learn morphing without the need for a dedicated morph dataset, addressing a significant limitation in the field. The use of higher diffusion timesteps to focus on capturing high-level morphing concepts while suppressing low-level artifacts is particularly innovative. The augmentation techniques, including temporal and spectral alignment, are well-justified and enhance the model's ability to produce coherent morphs. However, the methodology could benefit from a more detailed discussion on the choice of augmentation modes and their impact on the results.
The experiments are comprehensive, evaluating Mix2Morph against several baselines through both objective metrics and subjective listening tests. The paper provides a clear rationale for the selection of sound pairs and the design of the evaluation metrics, including the Latent Compressibility Score (LCS) and directionality measures. The results demonstrate that Mix2Morph consistently outperforms existing methods, showcasing its effectiveness in generating high-quality sound infusions. The statistical analysis of the subjective evaluations adds robustness to the findings.
The paper includes sufficient details regarding the model architecture and training procedures, allowing for reproducibility. However, the lack of a publicly available code repository limits the ease with which other researchers can replicate the results. Providing access to the training data or a similar dataset would further enhance reproducibility.
One limitation is the reliance on noisy surrogate mixes, which may not fully capture the complexity of high-quality morphs. Additionally, while the model shows improvements over baselines, there may still be cases where perceptual coherence is not fully achieved, particularly with more complex sound pairs. The subjective evaluation is limited to a small sample size, which may not represent the broader community's perceptions.
The Mix2Morph model has significant implications for sound design, particularly in fields such as film, gaming, and virtual reality, where high-quality sound morphing is essential for creating immersive experiences. The ability to generate sound infusions without extensive datasets opens new avenues for creativity and exploration in audio production. Mix2Morph represents a substantial advancement in sound morphing techniques, leveraging innovative training strategies and augmentation methods to produce high-quality audio infusions. The comprehensive evaluation of its performance against existing models highlights its potential to enhance sound design practices significantly.
Audio watermarking embeds auxiliary information into speech while maintaining speaker identity, linguistic content, and perceptual quality. Although recent advances in neural and digital signal processing-based watermarking methods have improved imperceptibility and embedding capacity, robustness is still primarily assessed against conventional distortions such as compression, additive noise, and resampling. However, the rise of deep learning-based attacks introduces novel and significant threats to watermark security. In this work, we investigate self voice conversion as a universal, content-preserving attack against audio watermarking systems. Self voice conversion remaps a speaker's voice to the same identity while altering acoustic characteristics through a voice conversion model. We demonstrate that this attack severely degrades the reliability of state-of-the-art watermarking approaches and highlight its implications for the security of modern audio watermarking techniques.
Primary: National Institute of Informatics
All Institutions: National Institute of Informatics
This paper presents a significant advancement in understanding the vulnerabilities of audio watermarking systems against modern voice conversion techniques. The research effectively combines theoretical insights with practical evaluations, making it a valuable contribution to the fields of audio processing and security.
The paper introduces a novel attack method, self voice conversion (VC), which effectively preserves speaker identity and linguistic content while degrading the performance of audio watermarking systems. The methodology is well-structured, detailing the attack framework and the specific voice conversion models employed (kNN-VC and RVC). The authors provide a clear rationale for the choice of methods and their relevance to the threat model, demonstrating a deep understanding of both the watermarking and voice conversion domains. However, the methodology could benefit from more extensive comparisons with other potential attack strategies to further validate its effectiveness.
The experimental evaluation is robust, employing a variety of watermarking systems and assessing their performance under the proposed self VC attack. The results are clearly presented in tables, showing the degradation of watermark extraction accuracy across different methods. The use of standard datasets (LibriTTS) and metrics (WER, UTMOS) adds credibility to the findings. However, the paper could improve by including more detailed statistical analyses to support the significance of the results.
The paper lacks specific implementation details or access to code and models, which limits reproducibility. While the authors mention that source code and model checkpoints will not be publicly released due to potential security implications, providing at least some implementation details or pseudo-code would enhance reproducibility and allow other researchers to validate the findings.
One limitation is the lack of real-world testing scenarios; the experiments are conducted under controlled conditions that may not fully capture the complexities of real-world audio processing and watermarking. Additionally, the reliance on specific voice conversion models may limit the generalizability of the findings to other models or methods not considered in the study.
The implications of this research are significant, as it highlights vulnerabilities in current audio watermarking techniques in the face of advanced voice conversion technologies. This work could inform the development of more robust watermarking methods and raise awareness about the potential misuse of voice conversion technologies in various applications, including copyright infringement and misinformation. This paper presents a significant advancement in understanding the vulnerabilities of audio watermarking systems against modern voice conversion techniques. The research effectively combines theoretical insights with practical evaluations, making it a valuable contribution to the fields of audio processing and security.
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Primary: Idiap Research Institute
All Institutions: Idiap Research Institute, Uniphore
The main contribution of this paper is the introduction of a novel text-only adaptation method for LLM-based ASR, which effectively preserves cross-modal alignment while improving performance in various adaptation scenarios. This work represents a meaningful advancement in the field, addressing a critical challenge in ASR systems and providing a practical solution for domain adaptation.
The proposed methodology introduces a novel approach to text-only adaptation in LLM-based ASR by framing the adaptation as a text denoising task. This reframing is innovative as it allows the model to learn from noisy text inputs without requiring additional parameters or architectural changes. The use of a multi-view noise-driven batching strategy is particularly effective in maintaining the alignment between speech and text modalities, which is a critical aspect of ASR systems. The authors provide a clear explanation of how the noise function is constructed and how it facilitates the training process, making the methodology both sound and theoretically grounded.
The experimental evaluation is thorough, utilizing two distinct datasets that represent realistic conversational scenarios. The results demonstrate significant improvements in WER across various adaptation scenarios, showcasing the effectiveness of the proposed method compared to existing techniques. The inclusion of ablation studies further strengthens the findings by isolating the contributions of key components in the training process. However, the paper could benefit from additional details on the experimental setup, such as hyperparameter tuning and validation strategies.
The paper provides a solid foundation for reproducibility with a detailed description of the experimental setup, including the models used and the training process. However, the lack of publicly available code or datasets limits the ability for other researchers to fully replicate the results. Including a link to a GitHub repository or similar would enhance the reproducibility of the findings.
One limitation of the proposed approach is its reliance on the quality of the noise function, which may not perfectly emulate the outputs of a speech projector. Additionally, while the method shows promise in various adaptation scenarios, performance still lags behind audio-based adaptation, indicating that further improvements are needed. The paper also does not address the potential computational costs associated with the proposed training strategy.
This research has significant implications for the field of ASR, particularly in scenarios where audio data is scarce or expensive to obtain. By enabling effective adaptation using only text data, the proposed method could facilitate the deployment of ASR systems in diverse domains, enhancing accessibility and usability in real-world applications. The approach could also inspire further research into alternative adaptation strategies that leverage text data in innovative ways. The main contribution of this paper is the introduction of a novel text-only adaptation method for LLM-based ASR, which effectively preserves cross-modal alignment while improving performance in various adaptation scenarios. This work represents a meaningful advancement in the field, addressing a critical challenge in ASR systems and providing a practical solution for domain adaptation.
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
Primary: Idiap Research Institute
All Institutions: Idiap Research Institute, Uniphore
The main contribution of this paper is a novel text-only adaptation method for LLM-based ASR systems that reformulates the adaptation challenge as a text denoising task, achieving substantial performance improvements without requiring additional parameters or architectural changes. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in the context of domain adaptation.
The proposed methodology effectively reframes the adaptation of LLM-based ASR systems as a text denoising task, which is innovative in its approach to preserving cross-modal alignment without requiring additional parameters or architectural changes. The multi-view noise-driven batching strategy is a clever solution to mitigate catastrophic forgetting, allowing the model to leverage both source and target domain data effectively. However, while the method is lightweight, it relies heavily on the quality of the noise function and the careful balancing of batch components, which could be sensitive to implementation details.
The experiments are well-structured, assessing the proposed method across three distinct adaptation scenarios: in-domain, out-of-domain, and cross-domain. The results demonstrate significant improvements in WER, with the method outperforming existing text-only adaptation techniques. The use of multiple datasets adds robustness to the findings, although the reliance on specific conversational corpora may limit generalizability to other domains.
The paper provides sufficient implementation details regarding the experimental setup, including model architectures, training parameters, and dataset descriptions. However, the lack of publicly available code or data limits the reproducibility of the results, which is a significant consideration for the research community.
One limitation is the potential sensitivity of the proposed method to the choice of noise function and the batch composition strategy. Additionally, while the method shows improvements, it still falls short of performance levels achieved with audio-based adaptation, indicating that further refinements are needed. The reliance on specific datasets may also restrict the applicability of the findings to other domains.
The research has significant implications for the field of ASR, particularly in scenarios where audio data is scarce or expensive to obtain. The ability to adapt LLM-based ASR systems using only text data could enhance the accessibility and scalability of ASR technologies across various applications, including assistive technologies and conversational AI. The main contribution of this paper is a novel text-only adaptation method for LLM-based ASR systems that reformulates the adaptation challenge as a text denoising task, achieving substantial performance improvements without requiring additional parameters or architectural changes. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in the context of domain adaptation.
Recent advances in Text-to-Speech (TTS) systems have substantially increased the realism of synthetic speech, raising new challenges for audio deepfake detection. This work presents a comparative evaluation of three state-of-the-art TTS models--Dia2, Maya1, and MeloTTS--representing streaming, LLM-based, and non-autoregressive architectures. A corpus of 12,000 synthetic audio samples was generated using the Daily-Dialog dataset and evaluated against four detection frameworks, including semantic, structural, and signal-level approaches. The results reveal significant variability in detector performance across generative mechanisms: models effective against one TTS architecture may fail against others, particularly LLM-based synthesis. In contrast, a multi-view detection approach combining complementary analysis levels demonstrates robust performance across all evaluated models. These findings highlight the limitations of single-paradigm detectors and emphasize the necessity of integrated detection strategies to address the evolving landscape of audio deepfake threats.
Primary: UncovAI
All Institutions: UncovAI, GENCI-IDRIS
The main contribution of this paper is a systematic evaluation of advanced TTS models against various detection frameworks, revealing significant challenges in audio deepfake detection. This work is crucial in addressing the evolving landscape of synthetic speech technologies and highlights the necessity for integrated detection strategies to combat emerging threats in audio forensics.
The paper employs a systematic approach to evaluate the performance of three advanced TTS models (Dia2, Maya1, and MeloTTS) against multiple detection frameworks. The authors construct a novel dataset of 12,000 synthetic audio samples, ensuring a diverse representation of modern TTS architectures. The methodology is well-structured, utilizing a multi-faceted detection strategy that combines semantic, structural, and signal-level analyses. However, the reliance on specific models and the absence of a broader range of TTS systems may limit the generalizability of the findings.
The experiments are comprehensive, with a clear focus on evaluating the performance of different detection models against the generated audio samples. The use of various metrics (EER, AUC, F1-Score) provides a robust framework for assessing detection capabilities. The results indicate significant variability in detector performance, particularly highlighting the challenges posed by LLM-based synthesis. The paper successfully demonstrates the limitations of single-paradigm detectors, emphasizing the need for integrated detection strategies.
The paper lacks detailed implementation specifics, such as hyperparameters, training protocols, and model architectures, which may hinder reproducibility. While the methodology is described, the absence of code or supplementary materials limits the ability for other researchers to replicate the experiments fully.
One limitation is the focus on only three TTS models, which may not encompass the full spectrum of current TTS technologies. Additionally, the dataset is derived from a single source (DailyDialog), potentially introducing biases that could affect the generalizability of the results. The paper also does not address the potential for adversarial attacks on detection models, which is a critical aspect in real-world applications.
This research has significant implications for the fields of audio forensics and security, particularly as TTS technologies continue to evolve. The findings underscore the importance of developing robust detection frameworks that can adapt to new generative mechanisms. The work could inform future research directions and the development of more resilient audio deepfake detection systems. The main contribution of this paper is a systematic evaluation of advanced TTS models against various detection frameworks, revealing significant challenges in audio deepfake detection. This work is crucial in addressing the evolving landscape of synthetic speech technologies and highlights the necessity for integrated detection strategies to combat emerging threats in audio forensics.
Modern zero-shot text-to-speech (TTS) models offer unprecedented expressivity but also pose serious crime risks, as they can synthesize voices of individuals who never consented. In this context, speaker unlearning aims to prevent the generation of specific speaker identities upon request. Existing approaches, reliant on retraining, are costly and limited to speakers seen in the training set. We present TruS, a training-free speaker unlearning framework that shifts the paradigm from data deletion to inference-time control. TruS steers identity-specific hidden activations to suppress target speakers while preserving other attributes (e.g., prosody and emotion). Experimental results show that TruS effectively prevents voice generation on both seen and unseen opt-out speakers, establishing a scalable safeguard for speech synthesis. The demo and code are available on http://mmai.ewha.ac.kr/trus.
Primary: Ewha Womans University
All Institutions: Ewha Womans University
The main contribution of this paper is the introduction of TruS, a training-free speaker unlearning framework that effectively prevents the generation of specific speaker identities in zero-shot TTS models. This work represents a meaningful advancement in the field of audio machine learning, addressing critical privacy concerns while maintaining the expressivity of synthesized speech.
The proposed TruS framework introduces a novel approach to speaker unlearning by focusing on inference-time control rather than traditional data deletion methods. This is a significant shift in perspective, as it allows for the suppression of specific speaker identities without the need for retraining the model. The methodology is well-structured, leveraging hidden activations to maintain other speech attributes, which is a clever way to balance the need for privacy with the expressivity of TTS systems. However, the paper could benefit from a more detailed explanation of the underlying mechanisms that allow for this control, particularly in terms of how it identifies and manipulates the relevant activations.
The experimental results are compelling, demonstrating the effectiveness of TruS in preventing voice generation for both seen and unseen opt-out speakers. The evaluation metrics used appear to be appropriate for assessing the framework's performance, and the results are presented clearly. However, the paper would be strengthened by including more extensive comparisons with existing methods, particularly those that involve retraining, to better contextualize the advantages of the proposed approach.
The paper mentions that the demo and code are available online, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics that would help other researchers replicate the results. Providing a clearer description of the datasets used, the experimental setup, and the parameters would enhance reproducibility.
One limitation is the reliance on the model's ability to suppress specific identities without retraining, which may not be universally applicable across all TTS architectures. Additionally, the paper does not address potential edge cases where the suppression might fail or lead to unintended consequences in voice synthesis. The scalability of the approach in real-world applications is also not thoroughly discussed.
The implications of this research are significant, particularly in the context of privacy and ethical concerns surrounding voice synthesis technologies. By providing a method for speaker unlearning, this work could help mitigate risks associated with unauthorized voice generation, thereby enhancing user trust in TTS systems. The framework has potential applications in various fields, including entertainment, security, and personal privacy. The main contribution of this paper is the introduction of TruS, a training-free speaker unlearning framework that effectively prevents the generation of specific speaker identities in zero-shot TTS models. This work represents a meaningful advancement in the field of audio machine learning, addressing critical privacy concerns while maintaining the expressivity of synthesized speech.
Recent neural audio compression models often rely on residual vector quantization for high-fidelity coding, but using a fixed number of per-frame codebooks is suboptimal for the wide variability of audio content-especially for signals that are either very simple or highly complex. To address this limitation, we propose SwitchCodec, a neural audio codec based on Residual Experts Vector Quantization (REVQ). REVQ combines a shared quantizer with dynamically routed expert quantizers that are activated according to the input audio, decoupling bitrate from codebook capacity and improving compression efficiency. This design ensures full training and utilization of each quantizer. In addition, a variable-bitrate mechanism adjusts the number of active expert quantizers at inference, enabling multi-bitrate operation without retraining. Experiments demonstrate that SwitchCodec surpasses existing baselines on both objective metrics and subjective listening tests.
Primary: Hangzhou Dianzi University
All Institutions: Hangzhou Dianzi University
The main contribution of this work is the introduction of SwitchCodec, a high-fidelity neural audio codec that utilizes Residual Experts Vector Quantization (REVQ) to improve audio compression efficiency and adaptability. This innovative approach addresses the limitations of existing codecs, demonstrating substantial improvements in audio quality across a range of bitrates, thereby advancing the field of neural audio coding.
The paper introduces SwitchCodec, which innovatively employs Residual Experts Vector Quantization (REVQ) to enhance audio compression. The methodology effectively combines a shared quantizer with dynamically routed expert quantizers, addressing the limitations of fixed quantization structures. The dual-path design allows for improved compression efficiency by decoupling bitrate from codebook capacity, which is a significant advancement over traditional methods. The use of a gating network to select quantizers based on audio content is a notable feature that enhances adaptability and performance.
The experiments are robust, utilizing multiple datasets (VCTK, LibriTTS, FMA, Common Voice) and employing both objective metrics (ViSQOL, Mel-spectrogram distance, STFT distance, PESQ) and subjective listening tests (MUSHRA). The results demonstrate that SwitchCodec outperforms existing baselines (DAC and EnCodec) across various bitrates, indicating its effectiveness in maintaining audio quality while optimizing bitrate. The thorough evaluation across different audio types strengthens the paper's claims.
The implementation details are adequately described, including training parameters, dataset preparation, and model architecture. The authors provide a demo URL with audio samples, which enhances reproducibility. However, the absence of a public code repository limits the ease of full reproducibility.
While the proposed method shows significant improvements, the paper does not extensively discuss potential limitations, such as the computational overhead introduced by the routing mechanism or the need for careful tuning of the gating network. Additionally, the scalability of the approach to more complex audio types or real-time applications is not addressed.
The advancements presented in SwitchCodec have the potential to significantly impact audio streaming and storage solutions, particularly in bandwidth-constrained environments. The ability to adaptively allocate bitrate based on content complexity could lead to more efficient use of resources in various applications, including music streaming, telecommunications, and multimedia content delivery. The main contribution of this work is the introduction of SwitchCodec, a high-fidelity neural audio codec that utilizes Residual Experts Vector Quantization (REVQ) to improve audio compression efficiency and adaptability. This innovative approach addresses the limitations of existing codecs, demonstrating substantial improvements in audio quality across a range of bitrates, thereby advancing the field of neural audio coding.
Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.
Primary: University of Cambridge
All Institutions: University of Cambridge, Laboratoire Informatique d'Avignon, Laboratoire d'Informatique de Grenoble
This paper provides valuable insights into the importance of data selection strategies in self-supervised speech models, highlighting the critical role of utterance length over diversity. The findings challenge existing assumptions in the field and offer a new perspective that could reshape future research and applications in speech processing.
The paper presents a systematic exploration of data selection strategies for pre-training self-supervised speech models, specifically focusing on the impact of utterance length versus diversity in acoustic, speaker, and linguistic features. The methodology is robust, employing a large-scale dataset (Loquacious) and a well-defined experimental setup that includes various sampling strategies. However, the reliance on simple unsupervised data selection methods, while insightful, may not fully leverage more complex data selection techniques that could yield even more significant results.
The experimental design is comprehensive, utilizing a substantial dataset and comparing multiple data selection strategies against baselines. The results clearly indicate that prioritizing longer utterances leads to better ASR performance, which is a significant finding. However, the paper could benefit from additional statistical analysis to further validate the robustness of the results across different datasets or conditions.
The authors commit to making their code public for reproducibility, which is a positive aspect. However, the paper lacks specific URLs for the code repository, which would facilitate easier access for other researchers. Detailed descriptions of the training settings and hyperparameters are provided, enhancing reproducibility.
The study's limitations include a narrow focus on data length without exploring more sophisticated data selection techniques. Additionally, the findings may not generalize across different languages or domains, as the study is primarily based on English speech data. The paper also does not address the potential trade-offs between data length and other factors that might influence performance.
The findings have significant implications for the development of self-supervised learning models in speech processing, particularly in optimizing pre-training datasets for efficiency. By demonstrating that longer utterances can yield better performance, this research could influence future practices in data collection and model training, potentially leading to more efficient use of computational resources in the field. This paper provides valuable insights into the importance of data selection strategies in self-supervised speech models, highlighting the critical role of utterance length over diversity. The findings challenge existing assumptions in the field and offer a new perspective that could reshape future research and applications in speech processing.
This paper focuses on audio deepfake detection under real-world communication degradations, with an emphasis on ultra-short inputs (0.5-2.0s), targeting the capability to detect synthetic speech at a conversation opening, e.g., when a scammer says "Hi." We propose Short-MGAA (S-MGAA), a novel lightweight extension of Multi-Granularity Adaptive Time-Frequency Attention, designed to enhance discriminative representation learning for short, degraded inputs subjected to communication processing and perturbations. The S-MGAA integrates two tailored modules: a Pixel-Channel Enhanced Module (PCEM) that amplifies fine-grained time-frequency saliency, and a Frequency Compensation Enhanced Module (FCEM) to supplement limited temporal evidence via multi-scale frequency modeling and adaptive frequency-temporal interaction. Extensive experiments demonstrate that S-MGAA consistently surpasses nine state-of-the-art baselines while achieving strong robustness to degradations and favorable efficiency-accuracy trade-offs, including low RTF, competitive GFLOPs, compact parameters, and reduced training cost, highlighting its strong potential for real-time deployment in communication systems and edge devices.
Primary: Institute for Digital Technologies
All Institutions: Institute for Digital Technologies, Loughborough University London, University of Exeter
This paper presents a novel lightweight framework for audio deepfake detection that excels in ultra-short inputs, demonstrating significant improvements over existing methods while maintaining efficiency for real-time applications. The technical contributions are substantial, addressing a critical gap in the field of audio deepfake detection.
The proposed Short-MGAA (S-MGAA) framework innovatively addresses the challenge of audio deepfake detection in ultra-short inputs (0.5-2.0s) by integrating two specialized modules: the Pixel-Channel Enhanced Module (PCEM) and the Frequency Compensation Enhanced Module (FCEM). This dual-module approach enhances the model's ability to capture fine-grained time-frequency saliency and compensates for limited temporal evidence, which is crucial for real-time applications. The architecture's lightweight design is commendable, making it suitable for deployment on resource-constrained devices. However, the paper could benefit from a more detailed theoretical justification for the specific choices of module designs and their interactions.
The experiments are robust, utilizing a comprehensive dataset constructed from multiple sources, which enhances the generalizability of the findings. The paper effectively compares S-MGAA against nine state-of-the-art baselines under various degradation conditions, demonstrating significant performance improvements, particularly in ultra-short audio scenarios. The use of Equal Error Rate (EER) as a metric is appropriate for the task, and the results convincingly illustrate the advantages of the proposed method. However, the paper could provide more insight into the statistical significance of the results to strengthen claims of superiority.
The implementation details are well-documented, including the dataset preparation and the training process. The authors provide a GitHub repository for code access, which is a positive aspect for reproducibility. However, the paper lacks detailed hyperparameter settings and specific configurations used during training, which could hinder full reproducibility by other researchers.
While the proposed method shows promising results, it is primarily tested on a specific set of audio inputs and conditions. The performance in more diverse real-world scenarios remains to be evaluated. Additionally, the lightweight nature of the model may come at the cost of performance in more complex or longer audio inputs, which could limit its applicability in broader contexts.
The implications of this research are significant, particularly in enhancing security measures against audio deepfakes in real-time communication systems. The ability to detect synthetic speech at the onset of a conversation could prevent scams and misinformation, contributing to safer digital interactions. As deepfake technology continues to evolve, advancements in detection methods like S-MGAA will be crucial in maintaining trust in audio communications. This paper presents a novel lightweight framework for audio deepfake detection that excels in ultra-short inputs, demonstrating significant improvements over existing methods while maintaining efficiency for real-time applications. The technical contributions are substantial, addressing a critical gap in the field of audio deepfake detection.
Most audio-visual speaker extraction methods rely on synchronized lip recording to isolate the speech of a target speaker from a multi-talker mixture. However, in natural human communication, co-speech gestures are also temporally aligned with speech, often emphasizing specific words or syllables. These gestures provide complementary visual cues that can be especially valuable when facial or lip regions are occluded or distant. In this work, we move beyond lip-centric approaches and propose SeLG, a model that integrates both lip and upper-body gesture information for robust speaker extraction. SeLG features a cross-attention-based fusion mechanism that enables each visual modality to query and selectively attend to relevant speech features in the mixture. To improve the alignment of gesture representations with speech dynamics, SeLG also employs a contrastive InfoNCE loss that encourages gesture embeddings to align more closely with corresponding lip embeddings, which are more strongly correlated with speech. Experimental results on the YGD dataset, containing TED talks, demonstrate that the proposed contrastive learning strategy significantly improves gesture-based speaker extraction, and that our proposed SeLG model, by effectively fusing lip and gesture cues with an attention mechanism and InfoNCE loss, achieves superior performance compared to baselines, across both complete and partial (i.e., missing-modality) conditions.
Primary: University of Science and Technology Beijing
All Institutions: University of Science and Technology Beijing, Alibaba Group, Tongyi Lab
The main contribution of this paper is the SeLG model, which effectively integrates lip and gesture cues for robust audio-visual speaker extraction, significantly improving performance in scenarios with missing modalities. This work represents a meaningful advancement in the field, addressing limitations of existing lip-centric approaches and providing a framework that could be applied in various practical applications.
The proposed SeLG model introduces a novel approach to audio-visual speaker extraction by integrating both lip and upper-body gesture cues. The use of a cross-attention mechanism for fusing these modalities is innovative, allowing for more nuanced interactions between visual inputs and speech features. The incorporation of a contrastive InfoNCE loss to align gesture embeddings with lip movements is a significant methodological advancement that enhances the robustness of the model, particularly in scenarios where one modality is missing or occluded. The architecture is well-structured, with clear delineation of components such as gesture and lip encoders, and a speech decoder, which collectively contribute to the model's efficacy.
The experimental setup is robust, utilizing the YGD dataset with a substantial number of samples and a clear division between training, validation, and testing. The results demonstrate significant improvements over baseline models, particularly in conditions with missing modalities, which underscores the effectiveness of the proposed methods. The use of SI-SNR as a performance metric is appropriate for evaluating the quality of the extracted speech signals. The detailed analysis of results across different scenarios (complete vs. missing modalities) adds depth to the evaluation, showcasing the model's versatility.
While the paper provides a comprehensive description of the model architecture and training procedures, it lacks specific implementation details such as code availability or links to a repository. This absence may hinder reproducibility, as other researchers would need to implement the model from scratch based on the provided descriptions.
One limitation of the study is the reliance on the YGD dataset, which may not fully represent the diversity of real-world scenarios encountered in audio-visual speaker extraction. Additionally, the model's performance in extremely noisy environments or with highly occluded gestures remains untested, which could limit its applicability in practical situations. The complexity of the model may also pose challenges in terms of computational efficiency and real-time processing.
The integration of gesture cues into audio-visual speaker extraction has significant implications for fields such as human-robot interaction, assistive technologies, and communication aids for the hearing impaired. By enhancing the robustness of speaker extraction in challenging conditions, this work could lead to improved systems for automatic speech recognition and other applications that rely on clear audio signals in multi-talker environments. The main contribution of this paper is the SeLG model, which effectively integrates lip and gesture cues for robust audio-visual speaker extraction, significantly improving performance in scenarios with missing modalities. This work represents a meaningful advancement in the field, addressing limitations of existing lip-centric approaches and providing a framework that could be applied in various practical applications.
Conformer and Mamba have achieved strong performance in speech modeling but face limitations in speaker diarization. Mamba is efficient but struggles with local details and nonlinear patterns. Conformer's self-attention incurs high memory overhead for long speech sequences and may cause instability in long-range dependency modeling. These limitations are critical for diarization, which requires both precise modeling of local variations and robust speaker consistency over extended spans. To address these challenges, we first apply ConBiMamba for speaker diarization. We follow the Pyannote pipeline and propose the Dual-Strategy-Enhanced ConBiMamba neural speaker diarization system. ConBiMamba integrates the strengths of Conformer and Mamba, where Conformer's convolutional and feed-forward structures are utilized to improve local feature extraction. By replacing Conformer's self-attention with ExtBiMamba, ConBiMamba efficiently handles long audio sequences while alleviating the high memory cost of self-attention. Furthermore, to address the problem of the higher DER around speaker change points, we introduce the Boundary-Enhanced Transition Loss to enhance the detection of speaker change points. We also propose Layer-wise Feature Aggregation to enhance the utilization of multi-layer representations. The system is evaluated on six diarization datasets and achieves state-of-the-art performance on four of them. The source code of our study is available at https://github.com/lz-hust/DSE-CBM.
Primary: Huazhong University of Science and Technology
All Institutions: Huazhong University of Science and Technology, Hubei Provincial Key Laboratory of Smart Internet Technology, School of Electronic Information and Communications
The main contribution of this paper is the development of the Dual-Strategy-Enhanced ConBiMamba for speaker diarization, which effectively combines local detail modeling and long-range dependency capture, achieving state-of-the-art performance across multiple datasets. This comprehensive analysis highlights the technical contributions and innovative methodologies that advance the field of speaker diarization.
The paper introduces a novel approach to speaker diarization by integrating Conformer and Mamba architectures into a new framework called ConBiMamba. The methodology is well-structured, addressing the limitations of existing models by proposing the Boundary-Enhanced Transition Loss and Layer-wise Feature Aggregation strategies. The auxiliary task of speaker change point detection is particularly innovative, as it enhances the model's ability to identify speaker transitions more accurately. The combination of these strategies demonstrates a thoughtful approach to tackling the complexities of diarization tasks.
The experiments are comprehensive, utilizing six diverse datasets to validate the proposed system's effectiveness. The results show that the proposed method achieves state-of-the-art performance on four out of six datasets, which is a significant contribution to the field. The detailed comparison with existing methods, including ablation studies, provides strong evidence for the effectiveness of the proposed enhancements. However, the paper could benefit from a more extensive discussion of the datasets' characteristics and their implications on the results.
The paper provides a clear description of the experimental setup, including training and inference procedures, which aids in reproducibility. The availability of the source code on GitHub further enhances the potential for other researchers to replicate the findings. However, the paper could improve by including more details on hyperparameter tuning and the specific configurations used for each dataset.
One limitation is the reliance on a fixed pre-trained WavLM model, which may not fully exploit the potential of the proposed system in all scenarios. Additionally, the paper acknowledges that the model does not explicitly handle overlapping speech, which could limit its performance in real-world applications where overlapping dialogue is common. Future work could explore these areas to enhance the model's robustness.
The proposed system has significant implications for applications in automated transcription, meeting summarization, and other areas where speaker diarization is critical. By improving the accuracy of speaker identification and change point detection, the research could enhance the usability of audio data in various fields, including education, business, and accessibility technologies. The main contribution of this paper is the development of the Dual-Strategy-Enhanced ConBiMamba for speaker diarization, which effectively combines local detail modeling and long-range dependency capture, achieving state-of-the-art performance across multiple datasets. This comprehensive analysis highlights the technical contributions and innovative methodologies that advance the field of speaker diarization.
Most universal sound extraction algorithms focus on isolating a target sound event from single-channel audio mixtures. However, the real world is three-dimensional, and binaural audio, which mimics human hearing, can capture richer spatial information, including sound source location. This spatial context is crucial for understanding and modeling complex auditory scenes, as it inherently informs sound detection and extraction. In this work, we propose a language-driven universal sound extraction network that isolates text-described sound events from binaural mixtures by effectively leveraging the spatial cues present in binaural signals. Additionally, we jointly predict the direction of arrival (DoA) of the target sound using spatial features from the extraction network. This dual-task approach exploits complementary location information to improve extraction performance while enabling accurate DoA estimation. Experimental results on the in-the-wild AudioCaps dataset show that our proposed LuSeeL model significantly outperforms single-channel and uni-task baselines.
Primary: Alibaba Group
All Institutions: Alibaba Group, Tongyi Lab
The main contribution of this work is the introduction of LuSeeL, a language-driven framework for joint binaural sound extraction and localization, which significantly improves performance by leveraging spatial information and multimodal inputs. This research addresses critical challenges in audio scene understanding and paves the way for advanced applications in various domains.
The proposed methodology introduces a dual-task framework that integrates sound event extraction and localization using binaural audio, leveraging a hybrid transformer architecture. The use of a T5 text encoder for processing language queries is innovative, allowing for flexible and context-aware sound extraction. The architecture effectively combines temporal and spectral information, enhancing the model's ability to handle complex auditory scenes. The integration of spatial features through the GCC-PHAT encoder further strengthens the model's performance in localization tasks. Overall, the methodology is well-structured and addresses critical challenges in audio scene analysis.
The experimental evaluation is robust, utilizing the AudioCaps dataset, which is appropriate for the tasks at hand. The results demonstrate significant improvements over baseline models in both sound extraction and localization tasks, with clear quantitative metrics (SI-SNRi, SDRi, accuracy, MAE) reported. The experiments also include ablation studies that effectively illustrate the contributions of each component of the model, validating the proposed dual-task approach. However, the paper could benefit from additional qualitative analysis of the results to provide deeper insights into model performance across different scenarios.
The paper provides a detailed description of the model architecture, training procedures, and loss functions, which aids in reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to replicate the results independently. Including such resources would enhance the paper's impact and facilitate further research in this area.
One limitation of the study is the reliance on the AudioCaps dataset, which, while large, may not encompass all possible real-world scenarios encountered in diverse auditory environments. Additionally, the model's performance in extremely noisy or highly overlapping sound conditions remains to be evaluated. The paper also does not address the computational complexity of the proposed model, which may limit its applicability in real-time applications.
The implications of this research are significant, particularly in fields such as augmented reality, robotics, and assistive technologies, where accurate sound localization and extraction can enhance user experience and system performance. The ability to process binaural audio with language queries opens new avenues for intuitive human-machine interaction, potentially leading to more immersive and responsive systems. The main contribution of this work is the introduction of LuSeeL, a language-driven framework for joint binaural sound extraction and localization, which significantly improves performance by leveraging spatial information and multimodal inputs. This research addresses critical challenges in audio scene understanding and paves the way for advanced applications in various domains.
Spatial audio is essential for immersive experiences, yet novel-view acoustic synthesis (NVAS) remains challenging due to complex physical phenomena such as reflection, diffraction, and material absorption. Existing methods based on single-view or panoramic inputs improve spatial fidelity but fail to capture global geometry and semantic cues such as object layout and material properties. To address this, we propose Phys-NVAS, the first physics-aware NVAS framework that integrates spatial geometry modeling with vision-language semantic priors. A global 3D acoustic environment is reconstructed from multi-view images and depth maps to estimate room size and shape, enhancing spatial awareness of sound propagation. Meanwhile, a vision-language model extracts physics-aware priors of objects, layouts, and materials, capturing absorption and reflection beyond geometry. An acoustic feature fusion adapter unifies these cues into a physics-aware representation for binaural generation. Experiments on RWAVS demonstrate that Phys-NVAS yields binaural audio with improved realism and physical consistency.
Primary: University of Technology Sydney
All Institutions: University of Technology Sydney, Harbin Engineering University, Nanjing University, University of Surrey
This paper presents a pioneering approach to novel-view acoustic synthesis by combining physics-aware modeling with vision-language priors, significantly advancing the state-of-the-art in spatial audio generation. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful contribution to the field of machine learning and audio processing.
The proposed Phys-NVAS framework innovatively integrates 3D acoustic environment modeling with vision-language semantic priors, addressing the limitations of existing NVAS methods that rely solely on visual inputs. The use of multi-view images and depth maps to reconstruct the acoustic environment is a significant advancement, as it enhances spatial awareness and captures complex acoustic phenomena such as reflection and absorption. The introduction of a physics-aware vision-language model to extract semantic cues is particularly noteworthy, as it allows for a more nuanced understanding of how different materials and layouts affect sound propagation. The methodology is well-structured, with clear steps for feature extraction and fusion, culminating in a robust binaural audio generation process.
The experiments conducted on the RWAVS dataset are comprehensive and demonstrate the effectiveness of the Phys-NVAS framework. The authors provide a thorough comparison with baseline methods, showcasing significant improvements in both magnitude distance (MAG) and envelope distance (ENV) metrics. The use of ablation studies to evaluate the contribution of different feature sources adds rigor to the experimental evaluation, confirming the complementary nature of geometric and semantic features in enhancing audio synthesis. However, the paper could benefit from additional qualitative assessments or user studies to further validate the perceptual improvements claimed.
The paper provides a clear description of the methodologies and models used, including specific architectures (e.g., ResNet-18, CLIP, BERT) and the RWAVS dataset. However, the lack of detailed implementation specifics, such as hyperparameters and training configurations, may hinder full reproducibility. Providing access to the code or a supplementary material section with these details would greatly enhance reproducibility.
One limitation of the study is the reliance on the RWAVS dataset, which may not encompass all possible acoustic environments or scenarios. Additionally, while the integration of vision-language priors is innovative, the performance may vary with different types of scenes or materials not represented in the dataset. The framework's complexity could also pose challenges in real-time applications, which are critical for immersive experiences in AR/VR.
The Phys-NVAS framework has significant potential applications in immersive audio experiences, particularly in AR/VR, gaming, and interactive media. By improving the realism and physical consistency of synthesized audio, it can enhance user engagement and presence in virtual environments. The integration of physics-aware models could also inspire further research in multi-modal learning and acoustic modeling, potentially leading to advancements in related fields such as robotics and autonomous systems. This paper presents a pioneering approach to novel-view acoustic synthesis by combining physics-aware modeling with vision-language priors, significantly advancing the state-of-the-art in spatial audio generation. The methodology is robust, and the experimental results demonstrate its effectiveness, marking a meaningful contribution to the field of machine learning and audio processing.
The performance evaluation remains a complex challenge in audio separation, and existing evaluation metrics are often misaligned with human perception, course-grained, relying on ground truth signals. On the other hand, subjective listening tests remain the gold standard for real-world evaluation, but they are expensive, time-consuming, and difficult to scale. This paper addresses the growing need for automated systems capable of evaluating audio separation without human intervention. The proposed evaluation metric, SAM Audio Judge (SAJ), is a multimodal fine-grained reference-free objective metric, which shows highly alignment with human perceptions. SAJ supports three audio domains (speech, music and general sound events) and three prompt inputs (text, visual and span), covering four different dimensions of evaluation (recall, percision, faithfulness, and overall). SAM Audio Judge also shows potential applications in data filtering, pseudo-labeling large datasets and reranking in audio separation models. We release our code and pre-trained models at: https://github.com/facebookresearch/sam-audio.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Meta
The main contribution of this paper is the introduction of SAM Audio Judge, a unified multimodal framework that offers a reference-free evaluation metric for audio separation, promising to enhance the alignment of automated evaluations with human perception. The proposed methodology and its potential applications signify a meaningful advancement in the field of audio processing, although further empirical validation and detailed methodological exposition are needed to fully realize its impact.
The paper introduces the SAM Audio Judge (SAJ), a novel multimodal framework for evaluating audio separation that does not rely on ground truth signals. This approach is significant as it aligns more closely with human perception, addressing a critical gap in existing evaluation metrics. The methodology is well-structured, incorporating multiple audio domains and prompt inputs, which enhances its applicability. However, the specifics of the algorithmic implementation and the underlying architecture could be elaborated further to provide deeper insights into its effectiveness.
The authors claim that SAJ demonstrates a high alignment with human perceptions through extensive evaluations across three audio domains and various input prompts. However, the paper lacks detailed experimental results, including quantitative metrics and comparisons with baseline methods, which would strengthen the claims made. The absence of a comprehensive evaluation section limits the ability to fully assess the performance of the proposed metric.
The authors have made their code and pre-trained models publicly available, which is a positive aspect for reproducibility. However, the paper would benefit from a more detailed description of the experimental setup, including data preprocessing steps, model training parameters, and evaluation protocols, to ensure that other researchers can replicate the results effectively.
One notable limitation is the reliance on subjective human perception as a benchmark, which, while valuable, can introduce variability and bias. Additionally, the paper does not address potential challenges in scaling the proposed metric for larger datasets or real-world applications. Furthermore, the generalizability of the metric across diverse audio contexts remains untested.
The SAM Audio Judge has significant potential applications in the field of audio processing, particularly in automating the evaluation of audio separation systems. This could lead to more efficient development cycles for audio technologies and improve the quality of audio content in various domains, such as music production, speech recognition, and sound event detection. The framework could also facilitate advancements in machine learning by providing a robust evaluation tool that aligns with human auditory perception. The main contribution of this paper is the introduction of SAM Audio Judge, a unified multimodal framework that offers a reference-free evaluation metric for audio separation, promising to enhance the alignment of automated evaluations with human perception. The proposed methodology and its potential applications signify a meaningful advancement in the field of audio processing, although further empirical validation and detailed methodological exposition are needed to fully realize its impact.
Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a major challenge. While some approaches achieve strong performance when fine-tuned on specific domains, few systems generalize well across out-of-domain datasets. Our prior work, Diarization-Conditioned Whisper (DiCoW), leverages speaker diarization outputs as conditioning information and, with minimal fine-tuning, demonstrated strong multilingual and multi-domain performance. In this paper, we address a key limitation of DiCoW: ambiguity in Silence-Target-Non-target-Overlap (STNO) masks, where two or more fully overlapping speakers may have nearly identical conditioning despite differing transcriptions. We introduce SE-DiCoW (Self-Enrolled Diarization-Conditioned Whisper), which uses diarization output to locate an enrollment segment anywhere in the conversation where the target speaker is most active. This enrollment segment is used as fixed conditioning via cross-attention at each encoder layer. We further refine DiCoW with improved data segmentation, model initialization, and augmentation. Together, these advances yield substantial gains: SE-DiCoW reduces macro-averaged tcpWER by 52.4% relative to the original DiCoW on the EMMA MT-ASR benchmark.
Primary: BUT FIT
All Institutions: BUT FIT, Johns Hopkins University
The main contribution of this paper is the introduction of SE-DiCoW, which effectively resolves speaker disambiguation issues in multi-speaker ASR through a novel self-enrollment mechanism. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in challenging environments with overlapping speech.
The proposed SE-DiCoW framework builds on the DiCoW architecture by introducing a self-enrollment mechanism that effectively addresses the ambiguity in STNO masks during overlapping speech. The methodology is well-structured, leveraging cross-attention to incorporate speaker-specific segments dynamically, which is a significant advancement over previous models. The enhancements in data segmentation, model initialization, and data augmentation further strengthen the approach, making it robust for real-world applications.
The experiments are comprehensive, utilizing multiple datasets and benchmarks to evaluate the performance of SE-DiCoW. The reported results demonstrate substantial improvements in tcpWER across various configurations, particularly in challenging multi-speaker scenarios. The comparison with oracle diarization and state-of-the-art systems provides a clear context for the effectiveness of the proposed method.
The paper provides sufficient implementation details, including model architecture, training protocols, and data augmentation strategies, which facilitate reproducibility. The availability of code repositories enhances the likelihood that other researchers can replicate the results.
While the SE-DiCoW shows significant improvements, the performance degradation observed with real diarization suggests that the model's effectiveness is still contingent on the quality of the diarization output. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other domains or languages.
The advancements in speaker-attributed ASR have significant implications for applications in meetings, interviews, and other multi-party conversations, where accurate transcription and speaker identification are critical. The potential for improved performance in real-world scenarios can enhance accessibility and usability of ASR technologies. The main contribution of this paper is the introduction of SE-DiCoW, which effectively resolves speaker disambiguation issues in multi-speaker ASR through a novel self-enrollment mechanism. This work represents a meaningful advancement in the field of automatic speech recognition, particularly in challenging environments with overlapping speech.
Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: https://vcbsl.github.io/VoxMorph/
Primary: University of North Texas
All Institutions: University of North Texas
The main contribution of this paper is the introduction of VoxMorph, a zero-shot voice identity morphing framework that significantly enhances the quality and scalability of voice morphing techniques. This work not only advances the state-of-the-art in voice morphing but also raises important considerations for biometric security and the potential for misuse in generating deepfakes. The methodology is innovative, the experimental validation is thorough, and the implications for the field are profound.
The proposed VoxMorph framework introduces a novel approach to voice identity morphing by disentangling vocal features into prosody and timbre embeddings. This separation allows for independent manipulation of these features, which is a significant advancement over previous methods that relied on monolithic embeddings. The use of Spherical Linear Interpolation (Slerp) for embedding fusion is a clever choice that preserves the geometric structure of the embeddings, leading to more natural morphs. The three-stage synthesis pipeline, which includes an autoregressive language model and a Conditional Flow Matching network, is well-structured and effectively leverages modern TTS architectures. Overall, the methodology is sound and innovative, addressing critical limitations in existing voice morphing techniques.
The experiments conducted are rigorous and comprehensive, utilizing a well-defined dataset (Librispeech) and a variety of evaluation metrics including Fréchet Audio Distance (FAD), Kullback-Leibler Divergence (KLD), and Word Error Rate (WER). The results demonstrate substantial improvements over baseline methods, with clear metrics supporting the claims of enhanced audio quality, intelligibility, and morphing effectiveness. The use of both single and multi-clip variants for evaluation adds robustness to the findings.
The paper provides sufficient detail regarding the implementation, including the datasets used, the architecture of the models, and the evaluation metrics. The availability of the code and dataset on the project page enhances reproducibility, allowing other researchers to validate the findings and build upon this work.
While the framework shows significant advancements, it is still limited to morphing between two identities at a time. The paper does not address the potential challenges of morphing more than two identities, which could be a valuable extension. Additionally, the reliance on a specific dataset may limit generalizability to other languages or dialects, and the performance in real-world scenarios outside the controlled experimental setup remains to be evaluated.
The implications of this work are substantial, particularly in the context of biometric security. By establishing a scalable method for voice morphing, VoxMorph poses a new threat to Automatic Speaker Verification systems, necessitating the development of more robust morphing attack detection systems. Furthermore, the framework could have applications in personalized voice synthesis, entertainment, and accessibility technologies, making it a versatile contribution to the field. The main contribution of this paper is the introduction of VoxMorph, a zero-shot voice identity morphing framework that significantly enhances the quality and scalability of voice morphing techniques. This work not only advances the state-of-the-art in voice morphing but also raises important considerations for biometric security and the potential for misuse in generating deepfakes. The methodology is innovative, the experimental validation is thorough, and the implications for the field are profound.
Speaker embedding learning based on Euclidean space has achieved significant progress, but it is still insufficient in modeling hierarchical information within speaker features. Hyperbolic space, with its negative curvature geometric properties, can efficiently represent hierarchical information within a finite volume, making it more suitable for the feature distribution of speaker embeddings. In this paper, we propose Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax) based on hyperbolic space. H-Softmax incorporates hierarchical information into speaker embeddings by projecting embeddings and speaker centers into hyperbolic space and computing hyperbolic distances. HAM-Softmax further enhances inter-class separability by introducing margin constraint on this basis. Experimental results show that H-Softmax and HAM-Softmax achieve average relative EER reductions of 27.84% and 14.23% compared with standard Softmax and AM-Softmax, respectively, demonstrating that the proposed methods effectively improve speaker verification performance and at the same time preserve the capability of hierarchical structure modeling. The code will be released at https://github.com/PunkMale/HAM-Softmax.
Primary: Xinjiang University
All Institutions: Xinjiang University, School of Computer Science and Technology, School of Intelligence Science and Technology, Xinjiang Multimodal Information Technology Engineering Research Center
The main contribution of this paper is the introduction of H-Softmax and HAM-Softmax, which effectively utilize hyperbolic space to enhance the modeling of hierarchical information in speaker embeddings, leading to improved performance in speaker verification tasks. This work presents a significant advancement in the field, combining innovative methodology with rigorous experimental validation.
The paper introduces two novel loss functions, Hyperbolic Softmax (H-Softmax) and Hyperbolic Additive Margin Softmax (HAM-Softmax), which leverage hyperbolic geometry to enhance speaker embedding learning. The methodology is well-structured, utilizing hyperbolic space to model hierarchical information effectively. The use of the Poincaré ball model is appropriate for the task, and the introduction of margin constraints in HAM-Softmax is a thoughtful extension that addresses inter-class separability. The mathematical formulations are clearly presented, and the rationale for using hyperbolic space is well-justified.
The experimental setup is robust, utilizing large datasets (VoxCeleb1, VoxCeleb2, and CNCeleb) that are relevant for speaker verification tasks. The results demonstrate significant improvements in performance metrics (EER and minDCF) compared to standard Softmax and other margin-based methods. The paper includes comparisons with multiple baselines, which strengthens the validity of the findings. The ablation studies provide valuable insights into the effects of curvature and margin settings on performance.
The paper provides sufficient details about the experimental setup, including dataset descriptions, model architecture, training parameters, and evaluation metrics. However, the absence of a demo or interactive visualization limits immediate reproducibility. The authors mention that code will be made available, which is crucial for facilitating further research.
While the proposed methods show promise, the paper does not extensively discuss potential limitations, such as the computational complexity of hyperbolic operations compared to their Euclidean counterparts. Additionally, the impact of hyperparameter tuning (e.g., margin and curvature) on different datasets could be explored further.
The findings have significant implications for speaker verification systems, particularly in applications requiring high accuracy and robustness against variations in speaker characteristics. The ability to model hierarchical information could also extend to other domains where similar structures exist, such as natural language processing and image classification. The main contribution of this paper is the introduction of H-Softmax and HAM-Softmax, which effectively utilize hyperbolic space to enhance the modeling of hierarchical information in speaker embeddings, leading to improved performance in speaker verification tasks. This work presents a significant advancement in the field, combining innovative methodology with rigorous experimental validation.
Neural audio codecs provide promising acoustic features for speech synthesis, with representative streaming codecs like Mimi providing high-quality acoustic features for real-time Text-to-Speech (TTS) applications. However, Mimi's decoder, which employs a hybrid transformer and convolution architecture, introduces significant latency bottlenecks on edge devices due to the the compute intensive nature of deconvolution layers which are not friendly for mobile-CPUs, such as the most representative framework XNNPACK. This paper introduces T-Mimi, a novel modification of the Mimi codec decoder that replaces its convolutional components with a purely transformer-based decoder, inspired by the TS3-Codec architecture. This change dramatically reduces on-device TTS latency from 42.1ms to just 4.4ms. Furthermore, we conduct quantization aware training and derive a crucial finding: the final two transformer layers and the concluding linear layers of the decoder, which are close to the waveform, are highly sensitive to quantization and must be preserved at full precision to maintain audio quality.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of T-Mimi, a transformer-based decoder that significantly reduces latency in on-device TTS applications. This work represents a meaningful advancement in the field of speech synthesis, particularly for mobile platforms, and addresses critical challenges related to latency and model efficiency.
The methodology presented in T-Mimi is innovative as it replaces the convolutional components of the existing Mimi codec with a purely transformer-based architecture. This approach is timely, given the increasing demand for low-latency TTS systems on mobile devices. The authors effectively draw inspiration from the TS3-Codec architecture, which adds to the credibility of their design choices. The paper also incorporates quantization-aware training, which is a relevant consideration for deploying models on edge devices. However, the paper could benefit from a more detailed description of the transformer architecture used and how it was specifically adapted for the TTS task.
The experiments conducted demonstrate a significant reduction in latency, from 42.1ms to 4.4ms, which is a substantial improvement for real-time applications. The evaluation metrics used to assess audio quality and latency are appropriate, and the results are clearly presented. However, additional comparisons with other state-of-the-art methods would strengthen the findings and provide a clearer context for the improvements achieved. The sensitivity analysis regarding quantization is a valuable addition, although more quantitative data on audio quality degradation with quantization could enhance the robustness of the claims.
The paper lacks sufficient implementation details that would facilitate reproducibility. There are no links to code repositories or datasets used for training and evaluation, which is a significant limitation for the research community. Providing these resources would allow others to validate the findings and build upon the work.
One limitation is the lack of detailed comparisons with other existing TTS systems, which would help contextualize the performance gains of T-Mimi. Additionally, while the paper mentions the sensitivity of the final layers to quantization, it does not explore the implications of this finding in depth or provide a comprehensive analysis of how this affects overall model performance in various scenarios. The absence of a user study or subjective listening tests is also a notable gap.
The development of T-Mimi has the potential to significantly impact the field of real-time speech synthesis, particularly for mobile applications where latency is critical. By improving the efficiency of TTS systems, this work could enhance user experiences in various applications, including virtual assistants, accessibility tools, and gaming. The findings regarding quantization also contribute to the broader discourse on model deployment in resource-constrained environments. The main contribution of this paper is the introduction of T-Mimi, a transformer-based decoder that significantly reduces latency in on-device TTS applications. This work represents a meaningful advancement in the field of speech synthesis, particularly for mobile platforms, and addresses critical challenges related to latency and model efficiency.
As Speech Language Models (SLMs) transition from personal devices to shared, multi-user environments such as smart homes, a new challenge emerges: the model is expected to distinguish between users to manage information flow appropriately. Without this capability, an SLM could reveal one user's confidential schedule to another, a privacy failure we term interactional privacy. Thus, the ability to generate speaker-aware responses becomes essential for SLM safe deployment. Current SLM benchmarks test dialogue ability but overlook speaker identity. Multi-speaker benchmarks check who said what without assessing whether SLMs adapt their responses. Privacy benchmarks focus on globally sensitive data (e.g., bank passwords) while neglecting contextual privacy-sensitive information (e.g., a user's private appointment). To address this gap, we introduce VoxPrivacy, the first benchmark designed to evaluate interactional privacy in SLMs. VoxPrivacy spans three tiers of increasing difficulty, from following direct secrecy commands to proactively protecting privacy. Our evaluation of nine SLMs on a 32-hour bilingual dataset reveals a widespread vulnerability: most open-source models perform close to random chance (around 50% accuracy) on conditional privacy decisions, while even strong closed-source systems fall short on proactive privacy inference. We further validate these findings on Real-VoxPrivacy, a human-recorded subset, confirming that failures observed on synthetic data persist in real speech. Finally, we demonstrate a viable path forward: by fine-tuning on a new 4,000-hour training set, we improve privacy-preserving abilities while maintaining robustness. To support future work, we release the VoxPrivacy benchmark, the large-scale training set, and the fine-tuned model to foster the development of safer and more context-aware SLMs.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the VoxPrivacy benchmark, which evaluates interactional privacy in Speech Language Models, highlighting significant vulnerabilities in existing models and providing a pathway for future improvements. This work is significant as it addresses a pressing issue in the deployment of AI technologies in shared environments, ensuring that user privacy is prioritized.
The methodology is robust, introducing the VoxPrivacy benchmark with a clear structure that spans three tiers of increasing complexity. The authors effectively identify the gap in existing benchmarks by focusing on interactional privacy, which is a novel and critical aspect of SLMs. The use of both synthetic and real data for validation strengthens the methodology, although the reliance on synthetic data for the primary benchmark may raise questions about real-world applicability.
The experimental evaluation is thorough, involving nine different SLMs and a substantial bilingual dataset. The results reveal significant vulnerabilities in existing models, which is a crucial finding for the field. The authors provide a clear analysis of performance metrics, demonstrating the inadequacies of current models in handling privacy-sensitive information. The fine-tuning on a larger dataset to improve model performance is a commendable step that showcases practical implications.
The paper emphasizes reproducibility by making all key assets, methodologies, and datasets publicly available. The detailed descriptions of the dataset construction process and experimental configurations enhance the reproducibility of the research. However, the effectiveness of the fine-tuning process could benefit from more detailed reporting on hyperparameters and training conditions.
The primary limitation is the potential over-reliance on synthetic data, which may not fully capture the complexities of real-world interactions and privacy concerns. Additionally, while the benchmark is a significant step forward, the authors acknowledge that it may not cover all aspects of interactional privacy, suggesting that further work is needed in this area.
The work has significant implications for the deployment of SLMs in shared environments, addressing a critical gap in privacy management. By providing a benchmark and resources for improving interactional privacy, the authors contribute to the development of safer AI systems. This research could influence future designs of SLMs, promoting user trust and broader acceptance in multi-user contexts. The main contribution of this paper is the introduction of the VoxPrivacy benchmark, which evaluates interactional privacy in Speech Language Models, highlighting significant vulnerabilities in existing models and providing a pathway for future improvements. This work is significant as it addresses a pressing issue in the deployment of AI technologies in shared environments, ensuring that user privacy is prioritized.
Speech separation (SS) has advanced significantly with neural network-based methods, showing improved performance on signal-level metrics. However, these methods often struggle to maintain speech intelligibility in the separated signals, which can negatively affect the performance of downstream tasks such as speech recognition. In this work, we propose SLM-SS, a novel approach that applies speech language models to SS, aiming to enhance the intelligibility and coherence of the separated signals. We frame SS as discrete multi-codebook sequence generation, using Encoder-Decoder models to map quantized speech mixtures to target tokens. In addition to the autoregressive modeling strategy, we introduce a non-autoregressive model to improve decoding efficiency for residual tokens. Experimental results on the LibriMix dataset demonstrate that our approach shows significantly better preservation of speech intelligibility, leading to improved linguistic consistency in a variety of downstream tasks compared to existing approaches.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, VUI Labs, iFLYTEK Company Limited
The main contribution of this paper is the introduction of SLM-SS, a novel speech separation framework that leverages speech language models to enhance the intelligibility and coherence of separated speech signals. This work represents a meaningful advancement in the field of speech processing, combining innovative methodologies with robust experimental validation to address a critical challenge in speech separation.
The proposed SLM-SS framework innovatively integrates speech language models with a discrete multi-codebook sequence generation approach. The use of both autoregressive and non-autoregressive models for decoding is a notable advancement, as it addresses efficiency while maintaining speech intelligibility. The methodology is well-structured, employing established techniques like Encodec for quantization and Serialized Output Training (SOT) for sequence concatenation, which enhances the overall robustness of the approach. However, the paper could benefit from a more detailed explanation of the training process and hyperparameter tuning.
The experiments conducted on the LibriMix dataset are comprehensive, comparing SLM-SS against established baselines such as BSRNN and Sepformer. The metrics used, including subjective listening tests and various objective measures (WER, LPS, SBS), provide a well-rounded evaluation of the model's performance. The results indicate a significant improvement in speech intelligibility and coherence, supporting the authors' claims. However, the paper lacks a deeper discussion on the statistical significance of the results and potential variability across different datasets.
While the paper outlines the model architecture and training setup, it lacks specific implementation details that would facilitate reproducibility. Key aspects such as the exact training data splits, preprocessing steps, and the code for the models are not provided, which are critical for other researchers to replicate the findings. Including a link to a code repository would enhance reproducibility.
The paper acknowledges the limitations related to the number of codebooks and the associated computational costs. It also notes that the model's performance can degrade with fewer codebooks, which may limit its applicability in real-world scenarios where computational resources are constrained. Furthermore, the subjective nature of some evaluation metrics introduces variability that could affect the reliability of the results.
The proposed method has significant implications for various applications in speech processing, including automatic speech recognition, speaker identification, and assistive technologies for hearing-impaired individuals. By improving speech intelligibility in separated signals, SLM-SS could enhance user experience in real-time communication systems and contribute to advancements in AI-driven speech technologies. The main contribution of this paper is the introduction of SLM-SS, a novel speech separation framework that leverages speech language models to enhance the intelligibility and coherence of separated speech signals. This work represents a meaningful advancement in the field of speech processing, combining innovative methodologies with robust experimental validation to address a critical challenge in speech separation.
Universal speech enhancement aims at handling inputs with various speech distortions and recording conditions. In this work, we propose a novel hybrid architecture that synergizes the signal fidelity of discriminative modeling with the reconstruction capabilities of generative modeling. Our system utilizes the discriminative TF-GridNet model with the Sampling-Frequency-Independent strategy to handle variable sampling rates universally. In parallel, an autoregressive model combined with spectral mapping modeling generates detail-rich speech while effectively suppressing generative artifacts. Finally, a fusion network learns adaptive weights of the two outputs under the optimization of signal-level losses and the comprehensive Speech Quality Assessment (SQA) loss. Our proposed system is evaluated in the ICASSP 2026 URGENT Challenge (Track 1) and ranks the third place.
Primary: unknown
All Institutions: unknown
This paper presents a novel hybrid architecture for universal speech enhancement that combines the strengths of discriminative and generative models. The methodology is innovative, yet the limitations in reproducibility and real-time applicability suggest areas for further development.
The paper introduces a hybrid architecture that effectively combines discriminative and generative modeling approaches for universal speech enhancement. The use of the TF-GridNet model for the discriminative branch ensures high signal fidelity, while the autoregressive model with spectral mapping enhances the quality of generated speech. The fusion network that adaptively combines outputs from both branches is a notable innovation, allowing for the strengths of both methodologies to be leveraged. However, the paper lacks detailed descriptions of the architecture's complexity and the specific contributions of each component, which could be beneficial for understanding the overall effectiveness.
The evaluation of the proposed system in the ICASSP 2026 URGENT Challenge, where it achieved third place, indicates a strong performance. The experiments are well-structured, utilizing a substantial dataset of 1.3 million clean speech utterances across multiple languages. The results presented in terms of various intrusive and non-intrusive metrics demonstrate the hybrid model's effectiveness, particularly in maintaining competitive fidelity while enhancing perceptual quality. However, the absence of comparisons with more baseline models could limit the contextual understanding of its performance.
The paper provides some implementation details, such as the architecture specifications and loss functions used for training. However, it lacks sufficient information about the training process, hyperparameter tuning, and the specific datasets used, which are crucial for reproducibility. The absence of code or a project URL further complicates the ability to replicate the results.
The generative branch is limited to processing speech at 16 kHz, which may restrict its applicability in scenarios requiring full-band processing. Additionally, the high inference latency mentioned could hinder real-time applications, which is a significant limitation for practical deployment. The paper also does not address potential issues related to the generalization of the model across diverse speech conditions beyond those encountered in the challenge.
The proposed hybrid system has the potential to significantly improve applications in speech recognition, telecommunications, and assistive technologies for hearing-impaired individuals. By enhancing the quality of degraded speech, it could lead to better communication experiences in various environments. However, the limitations regarding processing speed and frequency range may restrict its immediate applicability in real-time systems. This paper presents a novel hybrid architecture for universal speech enhancement that combines the strengths of discriminative and generative models. The methodology is innovative, yet the limitations in reproducibility and real-time applicability suggest areas for further development.
Explainable AI (XAI) is commonly applied to anomalous sound detection (ASD) models to identify which time-frequency regions of an audio signal contribute to an anomaly decision. However, most audio explanations rely on qualitative inspection of saliency maps, leaving open the question of whether these attributions accurately reflect the spectral cues the model uses. In this work, we introduce a new quantitative framework for evaluating XAI faithfulness in machine-sound analysis by directly linking attribution relevance to model behaviour through systematic frequency-band removal. This approach provides an objective measure of whether an XAI method for machine ASD correctly identifies frequency regions that influence an ASD model's predictions. By using four widely adopted methods, namely Integrated Gradients, Occlusion, Grad-CAM and SmoothGrad, we show that XAI techniques differ in reliability, with Occlusion demonstrating the strongest alignment with true model sensitivity and gradient-+based methods often failing to accurately capture spectral dependencies. The proposed framework offers a reproducible way to benchmark audio explanations and enables more trustworthy interpretation of spectrogram-based ASD systems.
Primary: Loughborough University
All Institutions: Loughborough University, Royal Air Force Rapid Capabilities Office, Defence Science and Technology Laboratory
This paper presents a comprehensive framework for evaluating the faithfulness of XAI methods in machine anomalous sound detection, significantly advancing the interpretability of audio-based machine learning models. The rigorous methodology and detailed experimental evaluation contribute to a deeper understanding of model behavior, with potential applications in safety-critical industrial settings.
The paper introduces a novel quantitative framework for evaluating the faithfulness of explainable AI (XAI) methods in the context of machine anomalous sound detection (ASD). The methodology effectively links attribution relevance to model behavior through systematic frequency-band removal, allowing for a robust assessment of various XAI techniques. By employing four widely adopted methods (Integrated Gradients, Occlusion, Grad-CAM, and SmoothGrad), the authors provide a comparative analysis that highlights the strengths and weaknesses of each method in capturing model sensitivity to frequency bands. This rigorous approach is a significant advancement over the qualitative inspection methods that dominate the field.
The experiments are well-structured, utilizing the DCASE2023 Task 2 ASD dataset to evaluate the proposed framework. The paper conducts three main experiments: a qualitative comparison of XAI methods, a frequency-band importance analysis, and a faithfulness evaluation of the XAI methods. The results demonstrate clear differences in the performance of the XAI techniques, with Occlusion showing the strongest correlation with model behavior. The use of statistical measures such as Spearman's rank correlation coefficient enhances the rigor of the evaluation.
The paper provides sufficient details regarding the experimental setup, including the dataset, model architecture, and spectrogram extraction process. However, the absence of a publicly available code repository limits the reproducibility of the results. Future work should include sharing the implementation to facilitate validation by other researchers.
The study primarily focuses on frequency-based perturbations, neglecting temporal aspects that may also influence model sensitivity. Additionally, the use of linear frequency bands rather than perceptually informed scales could limit the applicability of the findings in practical settings. The evaluation framework is also tailored to a specific convolutional architecture, which may not generalize to other model types.
The findings have significant implications for the development of interpretable machine learning systems in safety-critical applications, such as industrial monitoring and maintenance. By providing a reliable method for assessing the faithfulness of XAI techniques, this work can help practitioners better understand model behavior and improve decision-making processes based on model predictions. The proposed framework could also inspire further research into XAI methods tailored for audio and other domains. This paper presents a comprehensive framework for evaluating the faithfulness of XAI methods in machine anomalous sound detection, significantly advancing the interpretability of audio-based machine learning models. The rigorous methodology and detailed experimental evaluation contribute to a deeper understanding of model behavior, with potential applications in safety-critical industrial settings.
Continual Learning (CL) in Automatic Speech Recognition (ASR) suffers from catastrophic forgetting when adapting to new tasks, domains, or speakers. A common strategy to mitigate this is to store a subset of past data in memory for rehearsal. However, rehearsal-based methods face key limitations: storing data is often costly, infeasible with pre-trained models, or restricted by privacy regulations. Running existing rehearsal-based methods with smaller memory sizes to alleviate these issues usually leads to degraded performance. We propose a rehearsal-based CL method that remains effective even with minimal memory. It operates in two stages: first, fine-tuning on the new task; second, applying Singular Value Decomposition (SVD) to the changes in linear layers and, in a parameter-efficient manner, retraining only gating vectors on the singular values, which control to extent to which updates from the first stage are accepted, using rehearsal. We extensively test and analyze our method on two monolingual and two multilingual benchmarks. Our method reduces forgetting and outperforms state-of-the-art CL approaches for ASR, even when limited to a single utterance per previous task.
Primary: KU Leuven
All Institutions: KU Leuven, IEEE Publication Technology Group
This paper presents a significant advancement in continual learning for ASR by introducing a memory-efficient rehearsal method that effectively balances learning new tasks with retaining knowledge from previous tasks. The methodology and experimental results collectively contribute to the ongoing efforts to improve the robustness and adaptability of machine learning models in dynamic settings.
The proposed methodology introduces a novel two-stage rehearsal-based continual learning approach that leverages Singular Value Decomposition (SVD) to manage updates in a parameter-efficient manner. The first stage involves fine-tuning on new tasks, while the second stage selectively retains updates based on their contribution to performance on both new and previous tasks. This method addresses the limitations of traditional rehearsal strategies, particularly in terms of memory efficiency, by allowing effective learning with minimal stored data. The introduction of gating vectors to control the influence of updates is a significant methodological innovation that enhances the model's adaptability.
The experiments conducted on diverse datasets, including monolingual and multilingual benchmarks, demonstrate the robustness of the proposed method. The results show that the method significantly reduces forgetting and outperforms state-of-the-art approaches, even with minimal memory sizes. The thorough evaluation across different scenarios and the use of statistical significance testing strengthen the credibility of the findings. However, the paper could benefit from more detailed comparisons with a broader range of methods.
The paper provides a GitHub repository link for code and implementation details, which is a positive aspect for reproducibility. However, the paper could enhance reproducibility by including more specific hyperparameter settings and training configurations used in the experiments.
One limitation is the reliance on a memory buffer, which, while minimized, still requires careful management to ensure effective learning. Additionally, the method's performance in scenarios with highly variable data distributions could be further explored. The paper does not discuss the potential computational overhead introduced by the SVD process, which could affect real-time applications.
The proposed method has significant implications for the deployment of ASR systems in dynamic environments, where continual learning is essential. By mitigating catastrophic forgetting with minimal memory usage, this approach can enhance the adaptability of ASR systems in real-world applications, such as voice assistants and transcription services, while also addressing privacy concerns related to data storage. This paper presents a significant advancement in continual learning for ASR by introducing a memory-efficient rehearsal method that effectively balances learning new tasks with retaining knowledge from previous tasks. The methodology and experimental results collectively contribute to the ongoing efforts to improve the robustness and adaptability of machine learning models in dynamic settings.
Real-world audio recordings often contain multiple speakers and various degradations, which limit both the quantity and quality of speech data available for building state-of-the-art speech processing models. Although end-to-end approaches that concatenate speech enhancement (SE) and speech separation (SS) to obtain a clean speech signal for each speaker are promising, conventional SE-SS methods suffer from complex degradations beyond additive noise. To this end, we propose \textbf{Geneses}, a generative framework to achieve unified, high-quality SE--SS. Our Geneses leverages latent flow matching to estimate each speaker's clean speech features using multi-modal diffusion Transformer conditioned on self-supervised learning representation from noisy mixture. We conduct experimental evaluation using two-speaker mixtures from LibriTTS-R under two conditions: additive-noise-only and complex degradations. The results demonstrate that Geneses significantly outperforms a conventional mask-based SE--SS method across various objective metrics with high robustness against complex degradations. Audio samples are available in our demo page.
Primary: The University of Tokyo
All Institutions: The University of Tokyo, National Institute of Advanced Industrial Science and Technology (AIST)
Geneses presents a novel generative framework for unified speech enhancement and separation, leveraging advanced methodologies to significantly improve performance in challenging audio conditions. This work is a meaningful contribution to the field of speech processing, addressing critical limitations of existing approaches and demonstrating potential for real-world applications.
The proposed methodology of Geneses integrates latent flow matching with a multi-modal diffusion Transformer, which is innovative in addressing the challenges of speech enhancement and separation in complex audio environments. By conditioning on self-supervised learning representations, the approach effectively utilizes the strengths of generative models to enhance and separate speech signals. This methodology shows promise in overcoming the limitations of traditional mask-based methods, particularly in handling non-additive noise and other complex degradations.
The experimental evaluation is robust, utilizing the LibriTTS-R dataset under two distinct conditions, which allows for a thorough comparison of Geneses against conventional methods. The metrics used for evaluation are not specified in the abstract but are likely comprehensive given the context. The results indicate a significant performance improvement, which is a strong indicator of the method's effectiveness.
The paper does not provide detailed implementation specifics or code availability, which raises concerns about reproducibility. However, the mention of audio samples on a demo page suggests some level of practical demonstration, but further details on the model architecture and training process would enhance reproducibility.
While the paper presents promising results, it does not address potential limitations such as the scalability of the model to larger datasets, the computational cost of the generative approach, or the generalizability of the results to different types of audio environments beyond the tested conditions.
The implications of this research are significant for fields such as telecommunications, hearing aids, and voice recognition systems, where clear speech signals are crucial. The ability to enhance and separate speech in noisy environments could lead to advancements in user experience and accessibility in various audio processing applications. Geneses presents a novel generative framework for unified speech enhancement and separation, leveraging advanced methodologies to significantly improve performance in challenging audio conditions. This work is a meaningful contribution to the field of speech processing, addressing critical limitations of existing approaches and demonstrating potential for real-world applications.
Raga identification in Indian Art Music (IAM) remains challenging due to the presence of numerous rarely performed Ragas that are not represented in available training datasets. Traditional classification models struggle in this setting, as they assume a closed set of known categories and therefore fail to recognise or meaningfully group previously unseen Ragas. Recent works have tried categorizing unseen Ragas, but they run into a problem of catastrophic forgetting, where the knowledge of previously seen Ragas is diminished. To address this problem, we adopt a unified learning framework that leverages both labeled and unlabeled audio, enabling the model to discover coherent categories corresponding to the unseen Ragas, while retaining the knowledge of previously known ones. We test our model on benchmark Raga Identification datasets and demonstrate its performance in categorizing previously seen, unseen, and all Raga classes. The proposed approach surpasses the previous NCD-based pipeline even in discovering the unseen Raga categories, offering new insights into representation learning for IAM tasks.
Primary: Indian Institute of Technology Kanpur
All Institutions: Indian Institute of Technology Kanpur, Katholieke Universiteit Leuven
The main contribution of this work is the introduction of a unified framework for Raga identification that effectively balances the discovery of unseen Ragas with the retention of knowledge about known Ragas, addressing the critical issue of catastrophic forgetting in machine learning models. This paper significantly advances the field of music information retrieval by providing a robust solution to a complex problem, showcasing the potential for machine learning to enhance the understanding and classification of intricate musical structures.
The proposed methodology introduces a novel framework that integrates supervised and unsupervised contrastive learning within a shared embedding space, effectively addressing the challenges of catastrophic forgetting and enabling the discovery of unseen Ragas. The use of a CNN-LSTM feature extractor followed by a self-attention encoder is a thoughtful choice, leveraging the strengths of both architectures to capture the intricate melodic structures of Indian Art Music. The methodology is well-structured, with clear delineation of the training phases and loss functions, although it could benefit from more detailed descriptions of hyperparameter tuning and the selection process for the temperature parameter in contrastive losses.
The experiments are robust, utilizing two distinct datasets (PIM and Saraga) to evaluate the proposed method's performance across known and unseen classes. The results demonstrate significant improvements over baseline methods, particularly in maintaining accuracy for known classes while effectively discovering new Ragas. The evaluation metrics used (ACC, NMI, ARI, and Silhouette Score) are appropriate for assessing clustering performance, although additional qualitative analyses, such as visualizations of the learned embeddings, could further strengthen the findings.
While the paper mentions that the code will be shared on GitHub, specific implementation details, such as the exact architecture of the CNN-LSTM model and the training configurations, are somewhat lacking. Providing access to the code and detailed instructions would enhance reproducibility. The absence of a demo URL limits the ability for others to interact with the model directly.
A notable limitation is the reliance on the assumption that the unlabelled dataset is strictly disjoint from the labelled one, which may not hold in all real-world scenarios. Additionally, the performance on certain Ragas, such as Bhopali and Shuddha-Kalyan, indicates that the model may still struggle with certain tonal similarities, suggesting room for improvement in distinguishing closely related classes.
The proposed framework has significant implications for music information retrieval and could enhance the accessibility of Indian Art Music by enabling better categorization and discovery of Ragas. The approach could also be adapted for other musical traditions, potentially fostering cross-cultural understanding and appreciation of diverse musical forms. The main contribution of this work is the introduction of a unified framework for Raga identification that effectively balances the discovery of unseen Ragas with the retention of knowledge about known Ragas, addressing the critical issue of catastrophic forgetting in machine learning models. This paper significantly advances the field of music information retrieval by providing a robust solution to a complex problem, showcasing the potential for machine learning to enhance the understanding and classification of intricate musical structures.
Forced alignment (FA) predicts start and end timestamps for words or characters in speech, but existing methods are language-specific and prone to cumulative temporal shifts. The multilingual speech understanding and long-sequence processing abilities of speech large language models (SLLMs) make them promising for FA in multilingual, crosslingual, and long-form speech settings. However, directly applying the next-token prediction paradigm of SLLMs to FA results in hallucinations and slow inference. To bridge the gap, we propose LLM-ForcedAligner, reformulating FA as a slot-filling paradigm: timestamps are treated as discrete indices, and special timestamp tokens are inserted as slots into the transcript. Conditioned on the speech embeddings and the transcript with slots, the SLLM directly predicts the time indices at slots. During training, causal attention masking with non-shifted input and label sequences allows each slot to predict its own timestamp index based on itself and preceding context, with loss computed only at slot positions. Dynamic slot insertion enables FA at arbitrary positions. Moreover, non-autoregressive inference is supported, avoiding hallucinations and improving speed. Experiments across multilingual, crosslingual, and long-form speech scenarios show that LLM-ForcedAligner achieves a 69%~78% relative reduction in accumulated averaging shift compared with prior methods. The checkpoint and inference code will be released later.
Primary: Northwestern Polytechnical University
All Institutions: Nanyang Technological University, Northwestern Polytechnical University, College of Computing and Data Science
The paper presents LLM-ForcedAligner, a novel forced alignment method that leverages large language models for accurate timestamp prediction in multilingual and long-form speech scenarios, significantly improving upon traditional approaches.
The proposed LLM-ForcedAligner introduces a novel approach to forced alignment by reformulating the task as a slot-filling paradigm, which significantly deviates from traditional methods that rely on acoustic similarities and language-specific structures. The use of special timestamp tokens as slots allows for a more flexible and efficient prediction of timestamps, leveraging the capabilities of speech large language models (SLLMs). The methodology is well-structured, employing causal attention masking and dynamic slot insertion to enhance prediction accuracy and speed. This approach addresses the limitations of existing methods, particularly in multilingual and long-form speech scenarios.
The experiments conducted across various multilingual and crosslingual datasets demonstrate the effectiveness of LLM-ForcedAligner, achieving substantial reductions in accumulated averaging shift (AAS) compared to existing methods. The evaluation metrics are clearly defined, and the results are presented comprehensively, showcasing the model's performance across different languages and scenarios. The use of both pseudo-timestamp labels and human-labeled datasets adds robustness to the experimental validation.
The paper provides sufficient implementation details, including the architecture of the AuT encoder and the LLM, as well as the training strategy. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Future work should prioritize making the implementation accessible to facilitate further research and validation.
The primary limitation identified is the reliance on pseudo-timestamp labels generated by the MFA method, which may introduce noise and systematic shifts. Additionally, the evaluation is primarily focused on Chinese, limiting the generalizability of findings across other languages. The uneven language distribution in the training dataset may also affect performance consistency.
The LLM-ForcedAligner has significant potential applications in multilingual speech processing, automatic subtitling, and enhancing accessibility in various domains. Its ability to accurately align speech with transcripts can improve user experiences in educational and entertainment contexts, making it a valuable contribution to the field. The paper presents LLM-ForcedAligner, a novel forced alignment method that leverages large language models for accurate timestamp prediction in multilingual and long-form speech scenarios, significantly improving upon traditional approaches.
In audiovisual automatic speech recognition (AV-ASR) systems, information fusion of visual features in a pre-trained ASR has been proven as a promising method to improve noise robustness. In this work, based on the prominent Whisper ASR, first, we propose a simple and effective visual fusion method -- use of visual features both in encoder and decoder (dual-use) -- to learn the audiovisual interactions in the encoder and to weigh modalities in the decoder. Second, we compare visual fusion methods in Whisper models of various sizes. Our proposed dual-use method shows consistent noise robustness improvement, e.g., a 35% relative improvement (WER: 4.41% vs. 6.83%) based on Whisper small, and a 57% relative improvement (WER: 4.07% vs. 9.53%) based on Whisper medium, compared to typical reference middle fusion in babble noise with a signal-to-noise ratio (SNR) of 0dB. Third, we conduct ablation studies examining the impact of various module designs and fusion options. Fine-tuned on 1929 hours of audiovisual data, our dual-use method using Whisper medium achieves 4.08% (MUSAN babble noise) and 4.43% (NoiseX babble noise) average WER across various SNRs, thereby establishing a new state-of-the-art in noisy conditions on the LRS3 AV-ASR benchmark. Our code is at https://github.com/ifnspaml/Dual-Use-AVASR
Primary: Institute for Communications Technology
All Institutions: Institute for Communications Technology
The paper presents a novel dual-use method for integrating visual features in AV-ASR systems, significantly improving noise robustness and establishing new benchmarks in the field. The comprehensive evaluation of the methodology and results demonstrates its potential impact on future developments in automatic speech recognition and multimodal systems.
The paper introduces a dual-use method for visual feature integration in AV-ASR systems, leveraging both the encoder and decoder of the Whisper ASR model. This approach is innovative as it aims to enhance noise robustness by modeling audiovisual interactions more effectively than traditional fusion methods. The methodology is well-structured, with clear explanations of the dual-use mechanism and comparisons with existing methods, although further details on the implementation could enhance clarity.
The experimental setup is robust, utilizing a significant amount of audiovisual data for fine-tuning and evaluating the proposed method against various baselines. The results demonstrate substantial improvements in word error rates (WER) under noisy conditions, establishing a new state-of-the-art performance. However, the paper could benefit from additional comparative analyses with more diverse models and noise types to strengthen the findings.
The authors provide a GitHub repository with code, which is a positive aspect for reproducibility. However, the paper lacks detailed descriptions of the training processes and hyperparameters, which could hinder full replication of the experiments by other researchers.
One limitation is the reliance on specific datasets (LRS3, LRS2, Voxceleb2) that may not generalize to all AV-ASR applications. Additionally, the paper does not address potential scalability issues when applying the dual-use method to larger models or different languages.
The proposed method has significant implications for real-world applications, particularly in environments with high background noise, such as automotive and smart devices. By improving noise robustness in AV-ASR systems, this research could enhance accessibility and usability in various consumer technologies. The paper presents a novel dual-use method for integrating visual features in AV-ASR systems, significantly improving noise robustness and establishing new benchmarks in the field. The comprehensive evaluation of the methodology and results demonstrates its potential impact on future developments in automatic speech recognition and multimodal systems.
Visual information, such as subtitles in a movie, often helps automatic speech recognition. In this paper, we propose Donut-Whisper, an audio-visual ASR model with dual encoder to leverage visual information to improve speech recognition performance in both English and Chinese. Donut-Whisper combines the advantage of the linear and the Q-Former-based modality alignment structures via a cross-attention module, generating more powerful audio-visual features. Meanwhile, we propose a lightweight knowledge distillation scheme showcasing the potential of using audio-visual models to teach audio-only models to achieve better performance. Moreover, we propose a new multilingual audio-visual speech recognition dataset based on movie clips containing both Chinese and English partitions. As a result, Donut-Whisper achieved significantly better performance on both English and Chinese partition of the dataset compared to both Donut and Whisper large V3 baselines. In particular, an absolute 5.75% WER reduction and a 16.5% absolute CER reduction were achieved on the English and Chinese sets respectively compared to the Whisper ASR baseline.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of Donut-Whisper, a novel audio-visual ASR model that effectively integrates visual information from subtitles to improve speech recognition accuracy in both English and Chinese. This work represents a meaningful advancement in the field of multimodal machine learning, particularly in enhancing the robustness of ASR systems.
The proposed Donut-Whisper model employs a dual-encoder architecture that effectively integrates audio and visual modalities through a cross-attention mechanism. This design allows the model to leverage both audio signals and visual cues (subtitles) to enhance speech recognition performance. The use of a sliding-window Q-Former for local aggregation of audio features is particularly innovative, as it addresses the temporal locality of audio data, which is crucial for accurate speech recognition. The lightweight knowledge distillation approach adds further value by enabling the transfer of knowledge from a multimodal teacher to an audio-only student model, showcasing the potential for improving performance in low-resource scenarios.
The experiments are well-structured, utilizing a newly created multilingual audio-visual dataset that includes both English and Chinese subtitles. The comparative analysis against strong unimodal baselines (Donut and Whisper) demonstrates the effectiveness of the proposed model, with significant reductions in word and character error rates. The experiments also explore different fusion strategies, providing insights into the effectiveness of various configurations. However, the paper could benefit from more detailed statistical analysis of the results to further substantiate the claims of improvement.
The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, which is essential for reproducibility. However, the absence of a publicly available code repository or demo limits the ability of other researchers to replicate the findings directly. Including such resources would enhance the paper's impact and facilitate further research in this area.
One limitation of the study is the reliance on a specific dataset of movie clips, which may not generalize well to other domains or real-world applications. Additionally, while the model shows improvements in performance, the paper does not address the computational efficiency or latency of the proposed approach, which are critical factors for deployment in real-time applications.
The advancements in audio-visual speech recognition have significant implications for various applications, including accessibility technologies, language learning tools, and multimedia content analysis. By improving ASR performance in noisy environments and with out-of-domain vocabulary, this research could enhance user experiences in diverse settings, from education to entertainment. The main contribution of this paper is the introduction of Donut-Whisper, a novel audio-visual ASR model that effectively integrates visual information from subtitles to improve speech recognition accuracy in both English and Chinese. This work represents a meaningful advancement in the field of multimodal machine learning, particularly in enhancing the robustness of ASR systems.
Automatic speech quality assessment has become increasingly important as modern speech generation systems continue to advance, while human listening tests remain costly, time-consuming, and difficult to scale. Most existing learning-based assessment models rely primarily on scarce human-annotated mean opinion score (MOS) data, which limits robustness and generalization, especially when training across heterogeneous datasets. In this work, we propose UrgentMOS, a unified speech quality assessment framework that jointly learns from diverse objective and perceptual quality metrics, while explicitly tolerating the absence of arbitrary subsets of metrics during training. By leveraging complementary quality facets under heterogeneous supervision, UrgentMOS enables effective utilization of partially annotated data and improves robustness when trained on large-scale, multi-source datasets. Beyond absolute score prediction, UrgentMOS explicitly models pairwise quality preferences by directly predicting comparative MOS (CMOS), making it well suited for preference-based evaluation scenarios commonly adopted in system benchmarking. Extensive experiments across a wide range of speech quality datasets, including simulated distortions, speech enhancement, and speech synthesis, demonstrate that UrgentMOS consistently achieves state-of-the-art performance in both absolute and comparative evaluation settings.
Primary: Waseda University
All Institutions: Carnegie Mellon University, Shanghai Jiao Tong University, VUI Labs, Waseda University
The main contribution of this paper is the introduction of UrgentMOS, a unified framework for speech quality assessment that effectively leverages diverse metrics and pairwise preferences to improve robustness and generalization in evaluation scenarios. This work represents a meaningful step forward in the field, addressing critical challenges in speech quality assessment and providing a foundation for future research and applications.
The proposed UrgentMOS framework introduces a novel architecture that integrates two main components: the Absolute Metric Prediction Module (AMPM) and the Naturalness-Conditioned Preference Module (NCPM). This dual approach allows for the simultaneous prediction of multiple quality metrics and the modeling of pairwise preferences, which is a significant advancement over traditional methods that often focus on single metrics. The methodology effectively addresses the challenge of training with heterogeneous datasets and partially annotated data, showcasing a robust learning paradigm that tolerates missing metrics during training. The use of cross-attention in the NCPM to model preferences is particularly innovative and indicates a thoughtful approach to leveraging existing data structures.
The experiments conducted are extensive, covering various speech quality datasets, including those with simulated distortions and real-world applications like speech enhancement and synthesis. The results demonstrate that UrgentMOS achieves state-of-the-art performance in both absolute and comparative evaluations, which is a strong indicator of its effectiveness. The thoroughness of the experimental setup, including diverse datasets, enhances the credibility of the findings and showcases the framework's versatility.
While the paper outlines the architecture and methodology, it lacks detailed implementation specifics that would facilitate reproducibility. There is no mention of code availability or supplementary materials that could aid other researchers in replicating the study. This is a significant aspect that needs improvement, as reproducibility is crucial in machine learning research.
The paper acknowledges that UrgentMOS does not provide explicit natural language explanations for its quality judgments, which could limit its applicability in contexts where interpretability is essential. Additionally, while the framework improves robustness with multiple feature extractors, it also increases inference costs, which may hinder its deployment in latency-sensitive applications. The potential degradation of performance with a large number of heterogeneous metrics is another limitation that the authors recognize, suggesting a need for future work to optimize metric selection.
The advancements presented in UrgentMOS have the potential to significantly impact the field of speech quality assessment, particularly in applications where human evaluation is impractical. By improving the robustness and generalization of speech quality models, this research could enhance the development of speech generation systems, leading to better user experiences in various domains, including telecommunications, virtual assistants, and entertainment. The main contribution of this paper is the introduction of UrgentMOS, a unified framework for speech quality assessment that effectively leverages diverse metrics and pairwise preferences to improve robustness and generalization in evaluation scenarios. This work represents a meaningful step forward in the field, addressing critical challenges in speech quality assessment and providing a foundation for future research and applications.
Underwater acoustic target recognition (UATR) plays a vital role in marine applications but remains challenging due to limited labeled data and the complexity of ocean environments. This paper explores a central question: can speech large models (SLMs), trained on massive human speech corpora, be effectively transferred to underwater acoustics? To investigate this, we propose UATR-SLM, a simple framework that reuses the speech feature pipeline, adapts the SLM as an acoustic encoder, and adds a lightweight classifier.Experiments on the DeepShip and ShipsEar benchmarks show that UATR-SLM achieves over 99% in-domain accuracy, maintains strong robustness across variable signal lengths, and reaches up to 96.67% accuracy in cross-domain evaluation. These results highlight the strong transferability of SLMs to UATR, establishing a promising paradigm for leveraging speech foundation models in underwater acoustics.
Primary: Harbin Engineering University
All Institutions: Harbin Engineering University
This work establishes a new paradigm for underwater acoustic target recognition by demonstrating the effective adaptation of speech large models to a physically distinct domain. The innovative approach and strong experimental results highlight the potential for significant advancements in the field of underwater acoustics.
The proposed UATR-SLM framework effectively adapts speech large models (SLMs) for underwater acoustic target recognition by reusing the speech feature extraction pipeline and employing a lightweight classifier. This approach is innovative as it leverages existing knowledge from SLMs, which have been trained on extensive human speech datasets, to address the challenges of data scarcity and domain variability in underwater environments. The methodology is well-structured, allowing for comprehensive fine-tuning of the encoder while maintaining a simple architecture that is suitable for practical applications.
The experiments are robust, utilizing two well-established datasets (DeepShip and ShipsEar) to validate the performance of UATR-SLM. The results demonstrate state-of-the-art performance, with over 99% accuracy in in-domain evaluations and strong robustness across variable signal lengths. The cross-domain evaluation further highlights the model's generalization capabilities, achieving significant accuracy even in zero-shot conditions. The comprehensive evaluation metrics used (accuracy, precision, recall, F1-score) provide a thorough assessment of the model's performance.
While the paper outlines the experimental settings and model configurations, it lacks detailed implementation specifics that would aid in reproducibility, such as code availability or links to datasets. The absence of a demo or project URL further limits the ability for others to replicate the findings.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of underwater acoustic environments. Additionally, while the model shows promise, the paper does not address potential challenges in real-world deployment, such as varying noise conditions or the need for real-time processing.
The findings have significant implications for marine research, security, and environmental monitoring, as they suggest that advanced speech models can be repurposed for critical underwater applications. This could lead to more efficient and effective systems for recognizing underwater targets, ultimately benefiting various marine operations. This work establishes a new paradigm for underwater acoustic target recognition by demonstrating the effective adaptation of speech large models to a physically distinct domain. The innovative approach and strong experimental results highlight the potential for significant advancements in the field of underwater acoustics.
Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker (for state maintenance) and a foreground Speaker (for speculative solving). This parallel design enables "thinking while speaking" without blocking responses. We also introduce a Pause-and-Repair benchmark containing natural disfluencies to stress-test streaming robustness. Experiments across VERA, Spoken-MQA, BigBenchAudio, and our benchmark show that LTS-VoiceAgent achieves a stronger accuracy-latency-efficiency trade-off than serial cascaded baselines and existing streaming strategies.
Primary: Meituan
All Institutions: Meituan
The main contribution of this paper is the introduction of LTS-VoiceAgent, a novel framework that enhances real-time voice interaction by enabling simultaneous reasoning and speech generation. This advancement addresses critical challenges in latency and reasoning depth, making it a valuable contribution to the field of audio machine learning.
The proposed LTS-VoiceAgent framework introduces a novel Listen-Think-Speak architecture that effectively separates reasoning from speech generation, allowing for more natural and efficient voice interactions. The Dynamic Semantic Trigger and Dual-Role Stream Orchestrator are innovative components that address the latency and reasoning depth challenges in real-time voice agents. The methodology is well-structured, leveraging self-supervised data synthesis for training the trigger, and it employs a robust asynchronous coordination mechanism between the Thinker and Speaker roles. This design allows for incremental reasoning and efficient state management, which is a significant advancement over traditional cascaded architectures.
The experiments conducted across multiple benchmarks (VERA, Spoken-MQA, BigBenchAudio, and the newly introduced Pause-and-Repair benchmark) demonstrate the effectiveness of LTS-VoiceAgent in achieving a superior accuracy-latency-efficiency trade-off compared to existing methods. The use of realistic ASR behavior in the evaluation enhances the credibility of the results. The ablation studies provide clear insights into the contributions of the Dynamic Semantic Trigger and the orchestration mechanism, reinforcing the robustness of the proposed approach.
The paper provides detailed implementation information, including the architecture of the Dynamic Semantic Trigger and the orchestration mechanism. However, the lack of publicly available code or datasets limits reproducibility. The authors mention using a unified internal streaming ASR API, but without access to the actual implementation, it may be challenging for others to replicate the results fully.
The paper acknowledges several limitations, including the narrow range of backbones and languages tested, the inability to capture the full diversity of real conversations, and the lack of human-subject studies. Additionally, the Pause-and-Repair benchmark, while innovative, may not encompass all real-world scenarios, such as multi-turn dialogues and various accents.
The LTS-VoiceAgent framework has significant implications for the development of more responsive and intelligent voice interaction systems. Its ability to handle interruptions and generate near-instant replies could enhance user experience in various applications, from customer service to personal assistants. However, the potential for misuse in creating more convincing automated systems for social engineering must be considered, along with the ethical implications of continuous audio processing. The main contribution of this paper is the introduction of LTS-VoiceAgent, a novel framework that enhances real-time voice interaction by enabling simultaneous reasoning and speech generation. This advancement addresses critical challenges in latency and reasoning depth, making it a valuable contribution to the field of audio machine learning.
Recent progress of voice conversion~(VC) has achieved a new milestone in speaker cloning and linguistic preservation. But the field remains fragmented, relying on specialized models for linguistic-preserving, expressive, and singing scenarios. We propose OneVoice, a unified zero-shot framework capable of handling all three scenarios within a single model. OneVoice is built upon a continuous language model trained with VAE-free next-patch diffusion, ensuring high fidelity and efficient sequence modeling. Its core design for unification lies in a Mixture-of-Experts (MoE) designed to explicitly model shared conversion knowledge and scenario-specific expressivity. Expert selection is coordinated by a dual-path routing mechanism, including shared expert isolation and scenario-aware domain expert assignment with global-local cues. For precise conditioning, scenario-specific prosodic features are fused into each layer via a gated mechanism, allowing adaptive usage of prosody information. Furthermore, to enable the core idea and alleviate the imbalanced issue (abundant speech vs. scarce singing), we adopt a two-stage progressive training that includes foundational pre-training and scenario enhancement with LoRA-based domain experts. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios, while verifying flexible control over scenarios and offering a fast decoding version as few as 2 steps. Code and model will be released soon.
Primary: JIUTIAN Research
All Institutions: JIUTIAN Research
OneVoice represents a significant advancement in the field of voice conversion by proposing a unified framework that effectively integrates multiple scenarios into a single model. The innovative use of a Mixture-of-Experts architecture and a dual-path routing mechanism enhances the model's flexibility and performance, paving the way for future developments in voice synthesis technologies.
The proposed methodology in OneVoice is innovative, utilizing a Mixture-of-Experts (MoE) architecture to unify three distinct voice conversion scenarios (linguistic-preserving, expressive, and singing) within a single model. The dual-path routing mechanism and scenario-specific prosodic conditioning are particularly noteworthy, allowing the model to dynamically adapt to different input scenarios while maintaining high fidelity. The two-stage progressive training approach effectively addresses the challenge of imbalanced data between speech and singing, showcasing a thoughtful design that balances foundational knowledge with scenario-specific enhancements.
The experiments conducted are extensive and demonstrate the efficacy of OneVoice across various scenarios. The evaluation metrics used, including content enjoyment, character error rate, and speaker similarity, provide a comprehensive view of the model's performance. Results indicate that OneVoice matches or surpasses specialized models, which is a significant achievement. However, the paper could benefit from more detailed comparisons with a broader range of existing models to further validate its claims.
The paper outlines the experimental setup, including the datasets used and the training parameters, which aids in reproducibility. However, the lack of a publicly available code repository at the time of review limits the ability for other researchers to replicate the results fully. The authors mention that the code and model will be released soon, which is a positive step towards enhancing reproducibility.
While OneVoice shows promising results, it still faces limitations, particularly in matching subjective preferences in voice conversion. The non-streaming architecture may also restrict its applicability in real-time scenarios. Future work is suggested to address these limitations, including optimizing for preference data and enhancing streaming capabilities.
The potential applications of OneVoice are significant, ranging from entertainment and gaming to assistive technologies and privacy protection in communication. By providing a unified framework for voice conversion, this research could lead to more adaptive and scalable systems that cater to diverse user needs. OneVoice represents a significant advancement in the field of voice conversion by proposing a unified framework that effectively integrates multiple scenarios into a single model. The innovative use of a Mixture-of-Experts architecture and a dual-path routing mechanism enhances the model's flexibility and performance, paving the way for future developments in voice synthesis technologies.
Automated piano performance evaluation traditionally relies on symbolic (MIDI) representations, which capture note-level information but miss the acoustic nuances that characterize expressive playing. I propose using pre-trained audio foundation models, specifically MuQ and MERT, to predict 19 perceptual dimensions of piano performance quality. Using synthesized audio from PercePiano MIDI files (rendered via Pianoteq), I compare audio and symbolic approaches under controlled conditions where both derive from identical source data. The best model, MuQ layers 9-12 with Pianoteq soundfont augmentation, achieves R^2 = 0.537 (95% CI: [0.465, 0.575]), representing a 55% improvement over the symbolic baseline (R^2 = 0.347). Statistical analysis confirms significance (p < 10^-25) with audio outperforming symbolic on all 19 dimensions. I validate the approach through cross-soundfont generalization (R^2 = 0.534 +/- 0.075), difficulty correlation with an external dataset (rho = 0.623), and multi-performer consistency analysis. Analysis of audio-symbolic fusion reveals high error correlation (r = 0.738), explaining why fusion provides minimal benefit: audio representations alone are sufficient. I release the complete training pipeline, pretrained models, and inference code.
Primary: unknown
All Institutions: unknown
This paper demonstrates the superiority of audio foundation models over symbolic representations for evaluating piano performances, achieving significant improvements in predictive accuracy across multiple dimensions of performance quality. The comprehensive methodology, rigorous experimental validation, and potential for broader applications underscore its importance in advancing the field of automated music evaluation.
The methodology is robust, leveraging pre-trained audio foundation models (MuQ and MERT) to evaluate piano performance quality across 19 perceptual dimensions. The paper effectively addresses the limitations of traditional symbolic representations by utilizing audio features that capture acoustic nuances. The controlled experiments and comprehensive ablation studies demonstrate a thorough understanding of the task and provide clear insights into the model architecture and training process.
The experiments are well-structured, utilizing a large dataset (PercePiano) and employing rigorous statistical analysis to validate results. The paper reports significant improvements in performance metrics (R^2) when using audio representations compared to symbolic baselines. The validation through cross-soundfont generalization and external datasets adds credibility to the findings, although the reliance on synthesized audio is a point of concern.
The authors release the complete training pipeline, pretrained models, and inference code, which is commendable and enhances reproducibility. However, the paper could benefit from clearer documentation on the implementation details to facilitate easier replication by other researchers.
The primary limitation is the use of synthesized audio from MIDI files, which may not fully capture the complexities of real piano performances. Additionally, the results are validated mainly on the PercePiano dataset, and further validation on diverse datasets would strengthen the claims. The model's focus on piece-level characteristics may limit its applicability for fine-grained performer comparisons.
This research has significant implications for the field of music information retrieval (MIR) and automated performance evaluation. By demonstrating the effectiveness of audio foundation models, the findings could influence future research directions and applications in music education, performance assessment, and even in developing intelligent music systems. This paper demonstrates the superiority of audio foundation models over symbolic representations for evaluating piano performances, achieving significant improvements in predictive accuracy across multiple dimensions of performance quality. The comprehensive methodology, rigorous experimental validation, and potential for broader applications underscore its importance in advancing the field of automated music evaluation.
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.
Primary: Core contributors
All Institutions: Core contributors
The main contribution of this paper is the introduction of VibeVoice-ASR, a unified framework for long-form speech understanding that effectively addresses context fragmentation and multi-speaker challenges through innovative single-pass processing and context injection mechanisms. This work represents a meaningful advancement in the field of speech recognition, particularly for applications requiring high fidelity in complex audio environments.
The methodology presented in VibeVoice-ASR is innovative, particularly in its approach to unify ASR, speaker diarization, and timestamping into a single end-to-end task. The use of a single-pass processing mechanism for long-form audio is a significant departure from traditional chunk-based methods, addressing context fragmentation effectively. The prompt-based context injection mechanism is a noteworthy addition, allowing for customization that enhances accuracy in domain-specific scenarios. However, the paper lacks detailed descriptions of the model architecture and the specific algorithms employed, which would have strengthened the methodological rigor.
The experiments conducted are comprehensive, utilizing multiple datasets to evaluate the performance of the proposed model against state-of-the-art systems. The metrics chosen (DER, WER, cpWER, tcpWER) provide a well-rounded assessment of the system's capabilities. The results indicate a clear advantage in speaker modeling and transcription accuracy, which is a strong point of the paper. However, the absence of a detailed comparison with other recent models in the same domain limits the contextual understanding of the contributions.
The paper mentions plans for open-sourcing model weights and fine-tuning pipelines, which is a positive step towards reproducibility. However, the lack of specific implementation details, such as hyperparameters and training configurations, may hinder full reproducibility by other researchers. The reliance on several external models and tools without detailed integration steps also poses challenges for replication.
The authors acknowledge limitations, including potential multilingual forgetting during the supervised fine-tuning phase and challenges with overlapping speech. These limitations are significant, as they highlight areas where the model may not perform optimally, particularly in multilingual contexts or in scenarios with multiple speakers talking simultaneously.
The framework has the potential to significantly impact various applications, including real-time transcription services for meetings, podcasts, and educational lectures. By addressing the complexities of long-form audio processing and supporting multiple languages, VibeVoice-ASR could enhance accessibility and usability in diverse linguistic environments. The commitment to open-source development further encourages community engagement and adaptation of the technology. The main contribution of this paper is the introduction of VibeVoice-ASR, a unified framework for long-form speech understanding that effectively addresses context fragmentation and multi-speaker challenges through innovative single-pass processing and context injection mechanisms. This work represents a meaningful advancement in the field of speech recognition, particularly for applications requiring high fidelity in complex audio environments.
Emotion recognition is inherently ambiguous, with uncertainty arising both from rater disagreement and from discrepancies across modalities such as speech and text. There is growing interest in modeling rater ambiguity using label distributions. However, modality ambiguity remains underexplored, and multimodal approaches often rely on simple feature fusion without explicitly addressing conflicts between modalities. In this work, we propose AmbER$^2$, a dual ambiguity-aware framework that simultaneously models rater-level and modality-level ambiguity through a teacher-student architecture with a distribution-wise training objective. Evaluations on IEMOCAP and MSP-Podcast show that AmbER$^2$ consistently improves distributional fidelity over conventional cross-entropy baselines and achieves performance competitive with, or superior to, recent state-of-the-art systems. For example, on IEMOCAP, AmbER$^2$ achieves relative improvements of 20.3% on Bhattacharyya coefficient (0.83 vs. 0.69), 13.6% on R$^2$ (0.67 vs. 0.59), 3.8% on accuracy (0.683 vs. 0.658), and 4.5% on F1 (0.675 vs. 0.646). Further analysis across ambiguity levels shows that explicitly modeling ambiguity is particularly beneficial for highly uncertain samples. These findings highlight the importance of jointly addressing rater and modality ambiguity when building robust emotion recognition systems.
Primary: Massachusetts Institute of Technology
All Institutions: Massachusetts Institute of Technology
The main contribution of this paper is the introduction of AmbER$^2$, a dual ambiguity-aware emotion recognition framework that effectively models both rater and modality ambiguity, significantly improving the fidelity of emotion predictions in multimodal contexts. This work represents a meaningful advancement in the field of emotion recognition, particularly in its approach to handling the inherent complexities and uncertainties associated with emotional expression across different modalities.
The proposed AmbER$^2$ framework introduces a dual ambiguity-aware approach to emotion recognition, effectively addressing both rater and modality ambiguity through a teacher-student architecture. This methodology is innovative as it combines distribution-wise training objectives with adaptive guidance from modality-specific heads, which is a significant advancement over traditional feature fusion methods. The use of a weighted consistency loss that adjusts the influence of modality experts based on their reliability is particularly noteworthy, contributing to a more nuanced understanding of emotional cues across different modalities.
The experiments conducted on the IEMOCAP and MSP-Podcast datasets are comprehensive and well-structured, demonstrating the effectiveness of the proposed framework against conventional baselines and state-of-the-art systems. The reported improvements in distributional metrics (e.g., Bhattacharyya coefficient, R²) and classification metrics (e.g., accuracy, F1 score) provide strong evidence of the framework's performance. The analysis across different ambiguity levels adds depth to the evaluation, showcasing the framework's robustness in handling varying degrees of uncertainty.
The paper provides sufficient implementation details, including the architecture of the models, training parameters, and the datasets used. However, the absence of a publicly available code repository or demo URL limits reproducibility. Future work should consider releasing the code to facilitate validation and further exploration by the research community.
One limitation is the reliance on specific datasets (IEMOCAP and MSP-Podcast), which may not fully represent the diversity of real-world emotional expressions across cultures and languages. Additionally, while the framework shows promise in handling ambiguity, the complexity of the model may pose challenges in real-time applications. The paper could also benefit from a more detailed discussion on the computational efficiency and scalability of the proposed approach.
The findings of this research have significant implications for the development of more robust emotion recognition systems, which can enhance human-machine interactions in various applications, including virtual assistants, mental health monitoring, and customer service automation. By addressing ambiguity in emotion recognition, the framework paves the way for more human-aligned affective computing systems that can better understand and respond to human emotions. The main contribution of this paper is the introduction of AmbER$^2$, a dual ambiguity-aware emotion recognition framework that effectively models both rater and modality ambiguity, significantly improving the fidelity of emotion predictions in multimodal contexts. This work represents a meaningful advancement in the field of emotion recognition, particularly in its approach to handling the inherent complexities and uncertainties associated with emotional expression across different modalities.
Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel ASR on the separated speech. This approach results in an inefficient, lengthy pipeline and sub-optimal ASR performance due to the accumulated errors from preprocessing modules. Furthermore, most spatial feature extraction methods depend on the knowledge of speaker positions and microphone topology, making the systems reliant on specific settings and challenging to adapt to new equipment. In this work, we propose a solution to these issues with a lightweight embedding module named SpatialEmb, which extracts and encodes spatial information directly for the ASR model, supporting both fixed and arbitrary microphone topology. We conduct comprehensive experiments on AliMeeting, a real meeting corpus, to determine the optimal model design for SpatialEmb in terms of both performance and efficiency. Our best model trained with 105 hours Train-Ali-far achieves 17.04% and 20.32% character error rates (CER) on the Eval and Test sets, establishing a new state-of-the-art result with the same training data.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Tencent AI Lab
The main contribution of this paper is the development of SpatialEmb, a novel embedding module that enhances multi-channel ASR performance by directly integrating spatial information, leading to improved efficiency and accuracy in speech recognition tasks. This work represents a meaningful step forward in the field of audio processing and ASR, addressing critical limitations of existing systems.
The paper introduces a novel embedding module, SpatialEmb, which directly extracts and encodes spatial information for ASR, bypassing traditional multi-stage systems that rely on preprocessing. The methodology is well-structured, employing a lightweight design that supports arbitrary microphone topologies. The use of various embedding structures (Conv2d, ConvNext, GRU-Conv2d) and the parameter-free divide-average-concatenate (DAC) method to enhance efficiency is particularly innovative. The integration of spatial features with spectral features in a 1-stage ASR system is a significant advancement over existing methods.
The authors conduct comprehensive experiments on the AliMeeting dataset, demonstrating the effectiveness of their proposed model. The results show a clear improvement in character error rates (CER) compared to previous state-of-the-art systems, establishing the proposed method as a competitive alternative. The evaluation metrics are robust, and the experiments are well-documented, providing a thorough comparison with existing techniques.
The paper references the Icefall framework for implementation, which aids reproducibility. However, the lack of a demo or direct access to the code repository limits the ease with which other researchers can replicate the results. Detailed descriptions of the experimental setup and parameters are provided, which is beneficial for reproducibility.
One limitation is the reliance on the AliMeeting dataset, which may not generalize well to other domains or languages. Additionally, while the proposed method supports arbitrary microphone topologies, the performance in real-world scenarios with varying conditions remains to be fully validated. The computational efficiency, while improved, still may not meet the demands of all real-time applications.
The advancements in multi-channel ASR systems have significant implications for applications in real-time communication, such as virtual meetings and automated transcription services. The ability to handle arbitrary microphone arrays enhances the adaptability of ASR systems in diverse environments, potentially leading to broader adoption in various industries. The main contribution of this paper is the development of SpatialEmb, a novel embedding module that enhances multi-channel ASR performance by directly integrating spatial information, leading to improved efficiency and accuracy in speech recognition tasks. This work represents a meaningful step forward in the field of audio processing and ASR, addressing critical limitations of existing systems.
Automatic speech recognition (ASR) systems based on large language models (LLMs) achieve superior performance by leveraging pretrained LLMs as decoders, but their token-by-token generation mechanism leads to inference latency that grows linearly with sequence length. Meanwhile, discrete diffusion large language models (dLLMs) offer a promising alternative, enabling high-quality parallel sequence generation with pretrained decoders. However, directly applying native text-oriented dLLMs to ASR leads to a fundamental mismatch between open-ended text generation and the acoustically conditioned transcription paradigm required by ASR. As a result, it introduces unnecessary difficulty and computational redundancy, such as denoising from pure noise, inflexible generation lengths, and fixed denoising steps. We propose dLLM-ASR, an efficient dLLM-based ASR framework that formulates dLLM's decoding as a prior-guided and adaptive denoising process. It leverages an ASR prior to initialize the denoising process and provide an anchor for sequence length. Building upon this prior, length-adaptive pruning dynamically removes redundant tokens, while confidence-based denoising allows converged tokens to exit the denoising loop early, enabling token-level adaptive computation. Experiments demonstrate that dLLM-ASR achieves recognition accuracy comparable to autoregressive LLM-based ASR systems and delivers a 4.44$\times$ inference speedup, establishing a practical and efficient paradigm for ASR.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
The paper presents dLLM-ASR, a novel framework that enhances speech recognition efficiency by integrating discrete diffusion large language models with adaptive denoising strategies, achieving a significant speedup while maintaining high accuracy. This work represents a meaningful contribution to the field of automatic speech recognition, addressing critical challenges and paving the way for future innovations in real-time applications.
The proposed dLLM-ASR framework innovatively reformulates the ASR process by leveraging discrete diffusion large language models (dLLMs) in a way that addresses the inherent challenges of applying text-oriented models to speech recognition. The introduction of a prior-guided and adaptive denoising process is a significant methodological advancement, allowing for efficient token-level computation and dynamic sequence length adjustment. The confidence-based denoising and length-adaptive pruning strategies are particularly noteworthy, as they effectively reduce computational overhead while maintaining accuracy.
The experiments are comprehensive, utilizing multiple well-known datasets (LibriSpeech, CommonVoice, GigaSpeech) and benchmarks to evaluate the performance of dLLM-ASR against various baselines. The results demonstrate that dLLM-ASR achieves competitive recognition accuracy while significantly improving inference speed, showcasing a 4.44x speedup compared to autoregressive models. The ablation studies further validate the contributions of individual components of the framework.
The paper provides detailed implementation specifics, including model architecture, training strategies, and hyperparameters, which facilitate reproducibility. However, the absence of a public code repository or demo limits the ease of replication by other researchers.
While the paper presents a strong framework, it does not address potential limitations in terms of the model's performance on diverse languages or accents, which could affect generalizability. Additionally, the reliance on a specific pretrained dLLM may limit adaptability to other contexts or domains.
The advancements made in dLLM-ASR have significant implications for real-time speech recognition applications, particularly in scenarios where low latency is critical, such as virtual assistants, transcription services, and accessibility technologies. The methodology could inspire further research into integrating diffusion models in other areas of machine learning. The paper presents dLLM-ASR, a novel framework that enhances speech recognition efficiency by integrating discrete diffusion large language models with adaptive denoising strategies, achieving a significant speedup while maintaining high accuracy. This work represents a meaningful contribution to the field of automatic speech recognition, addressing critical challenges and paving the way for future innovations in real-time applications.
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available
Primary: University of Southern California
All Institutions: University of California, David Geffen School of Medicine, Weill Institute for Neurosciences, University of Southern California, Viterbi School of Engineering
The paper presents a unified framework for joint ASR and speaker role diarization, significantly improving the accuracy and efficiency of transcribing child-adult interactions. The methodology is innovative, addressing key challenges in the field and demonstrating substantial technical contributions through rigorous experimentation.
The proposed methodology integrates automatic speech recognition (ASR) and speaker role diarization into a unified end-to-end framework using the Whisper architecture. The approach is innovative in its serialized output training scheme, which allows for the simultaneous prediction of speaker tags, timestamps, and transcriptions, thereby addressing the limitations of traditional cascaded systems that suffer from error propagation. The introduction of a lightweight diarization head and a state-machine-based forced decoding mechanism further enhances the model's robustness and output structure. Overall, the methodology is well-structured and leverages existing technologies effectively while introducing novel components that improve performance in child-adult interaction contexts.
The experiments are comprehensive, utilizing two distinct datasets that reflect real-world child-adult interactions. The evaluation metrics, including multi-talker word error rates (mtWER) and diarization error rates (DER), are appropriate for assessing the model's performance. The results indicate significant improvements over baseline methods, demonstrating the effectiveness of the proposed framework. However, the paper could benefit from additional ablation studies to further clarify the contributions of each component.
The authors have made their code and model weights publicly available, which is a positive step towards reproducibility. However, detailed hyperparameter settings and training configurations could be more thoroughly documented to facilitate easier replication of results by other researchers.
One limitation is the potential for overfitting due to the relatively small size of the ADOS dataset, which may affect the generalizability of the model. Additionally, the paper does not extensively discuss the challenges faced during training or the specific conditions under which the model may fail, such as in cases of overlapping speech or extreme background noise.
The proposed framework has significant implications for developmental and clinical research, particularly in the context of assessing child language development and social communication patterns. By automating the transcription and diarization processes, the model can facilitate large-scale studies that were previously hindered by the labor-intensive nature of manual annotation. This could lead to more efficient data collection and analysis in clinical settings, ultimately benefiting research in child development and related fields. The paper presents a unified framework for joint ASR and speaker role diarization, significantly improving the accuracy and efficiency of transcribing child-adult interactions. The methodology is innovative, addressing key challenges in the field and demonstrating substantial technical contributions through rigorous experimentation.
Bangla, one of the most widely spoken languages, remains underrepresented in state-of-the-art automatic speech recognition (ASR) research, particularly under noisy and speaker-diverse conditions. This paper presents BanglaRobustNet, a hybrid denoising-attention framework built on Wav2Vec-BERT, designed to address these challenges. The architecture integrates a diffusion-based denoising module to suppress environmental noise while preserving Bangla-specific phonetic cues, and a contextual cross-attention module that conditions recognition on speaker embeddings for robustness across gender, age, and dialects. Trained end-to-end with a composite objective combining CTC loss, phonetic consistency, and speaker alignment, BanglaRobustNet achieves substantial reductions in word error rate (WER) and character error rate (CER) compared to Wav2Vec-BERT and Whisper baselines. Evaluations on Mozilla Common Voice Bangla and augmented noisy speech confirm the effectiveness of our approach, establishing BanglaRobustNet as a robust ASR system tailored to low-resource, noise-prone linguistic settings.
Primary: Ahsanullah University of Science and Technology
All Institutions: Ahsanullah University of Science and Technology
BanglaRobustNet represents a substantial advancement in automatic speech recognition for the Bangla language, introducing innovative techniques to enhance robustness in challenging acoustic conditions. The combination of phonetic-aware denoising and speaker-conditioned attention mechanisms is particularly noteworthy, addressing critical gaps in existing ASR systems for low-resource languages.
The methodology presented in BanglaRobustNet is innovative, combining a diffusion-based denoising module with a contextual cross-attention mechanism tailored for Bangla ASR. The design principles emphasize phonetic preservation and speaker adaptivity, which are critical for improving ASR performance in low-resource languages. The end-to-end training approach with a composite objective function is well-justified, addressing the unique challenges posed by Bangla phonetics and dialectal variations. However, while the architecture is robust, the paper could benefit from a more detailed discussion on the specific implementation of the denoising and attention mechanisms.
The experimental evaluation is thorough, utilizing diverse datasets that reflect real-world conditions, including both clean and noisy environments. The reported reductions in WER and CER compared to baseline models are significant, demonstrating the effectiveness of the proposed architecture. The use of statistical significance testing adds rigor to the results. However, the paper lacks a comprehensive ablation study that could provide deeper insights into the contributions of individual components of the architecture.
The paper provides a reasonable level of detail regarding the training infrastructure, hyperparameters, and evaluation protocols, which supports reproducibility. However, the absence of a publicly available code repository or demo limits the ability for other researchers to replicate the results independently. Including a link to a GitHub repository or similar would enhance reproducibility.
One limitation is the reliance on a relatively small dataset for training, which may affect the generalizability of the model to other dialects or noise conditions not represented in the training data. Additionally, the paper does not address potential biases in the training data or how they might impact the model's performance across different demographics.
The development of BanglaRobustNet has significant implications for improving access to technology for Bangla speakers, a demographic that has been underserved in the field of ASR. By enhancing speech recognition capabilities in noisy environments and across diverse speakers, this work could facilitate better communication tools and services for millions of users. Furthermore, the approach could inspire similar advancements in ASR for other low-resource languages. BanglaRobustNet represents a substantial advancement in automatic speech recognition for the Bangla language, introducing innovative techniques to enhance robustness in challenging acoustic conditions. The combination of phonetic-aware denoising and speaker-conditioned attention mechanisms is particularly noteworthy, addressing critical gaps in existing ASR systems for low-resource languages.