Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.
Primary: Tencent AI Lab
All Institutions: Tencent AI Lab
The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
The paper introduces SemanticVocoder, a novel approach that replaces traditional VAE acoustic latents with semantic latents for audio generation. The methodology is well-structured, leveraging a flow-matching approach to synthesize waveforms directly from semantic representations, thus addressing the limitations of conventional VAE-based systems. The use of a pretrained MAE encoder for extracting semantic latents is a significant innovation, enabling the model to focus on high-level semantic information rather than low-level acoustic details. The proposed architecture effectively balances the optimization difficulty across the text-to-latent and latent-to-waveform stages, which is a notable advancement in the field.
The experiments are comprehensive, utilizing multiple datasets (AudioCaps, AudioSet, and WavCaps) to evaluate the performance of SemanticVocoder against existing models. The reported results demonstrate superior performance in terms of Fréchet Distance and Fréchet Audio Distance, indicating that the model generates audio closer to real distributions. Additionally, the paper includes evaluations on audio understanding tasks, showcasing the discriminative power of semantic latents. However, the lack of subjective evaluation metrics is a minor drawback.
The paper provides detailed implementation specifics, including model architecture, training parameters, and datasets used, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for external researchers.
The paper acknowledges limitations such as dependency on the pretrained semantic encoder's performance, constraints on audio length generation, and the need for subjective evaluations to complement objective metrics. These factors could impact the model's applicability in real-world scenarios.
SemanticVocoder has the potential to significantly advance the field of audio generation and understanding by providing a unified framework that leverages semantic information. This could lead to improved applications in areas such as content creation, audio synthesis, and interactive media, thereby enhancing user experiences in various domains. The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Primary: Dartmouth College
All Institutions: Dartmouth College, Hume AI
The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
The proposed methodology introduces a novel tokenization scheme that achieves one-to-one synchronization between text and acoustic features, which is a significant advancement over traditional fixed-frame-rate approaches. The use of a dual alignment mechanism and a flow matching head within a large language model framework allows for efficient and high-fidelity speech synthesis. The architecture is well-structured, leveraging a combination of variational autoencoders and transformer-based models, which enhances the model's ability to handle both modalities concurrently. The approach to mitigate the modality gap through Speech Free Guidance (SFG) is particularly innovative, allowing for flexible integration of text and speech modalities.
The experiments conducted demonstrate the effectiveness of the proposed model against state-of-the-art TTS and SLM systems. The authors provide a comprehensive evaluation using multiple datasets and metrics, including character error rate, speaker similarity, and subjective evaluations of naturalness. The results indicate that the proposed method not only matches but often exceeds the performance of existing models, particularly in terms of reducing content hallucinations and improving inference efficiency. The extensive dataset used for training and evaluation further supports the robustness of the findings.
The paper includes a link to the GitHub repository containing the code and pre-trained models, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific configurations could be elaborated further to enhance clarity for future researchers attempting to replicate the results.
One limitation noted is the potential for speaker drifting during long-form generation, which suggests that while the model performs well in many scenarios, there are still challenges in maintaining speaker consistency over extended outputs. Additionally, the subjective evaluations indicate that while the model performs competitively, there is room for improvement in perceptual audio quality.
The implications of this research are significant for the field of speech synthesis and spoken language modeling. The ability to generate high-fidelity speech with reduced hallucinations and improved efficiency can enhance applications in voice cloning, virtual assistants, and interactive AI systems. The methodology could also pave the way for further advancements in multimodal AI systems, where seamless integration of text and speech is crucial. The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.
Primary: Tampere University
All Institutions: Tampere University, Nokia Technologies
The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
The proposed dual-branch parallel spectral-spatial (PS2) architecture represents a significant methodological advancement in the field of multi-channel speech separation. By separating the processing of spectral and spatial features, the authors effectively address the inherent modeling conflicts present in existing sequential architectures. The use of bi-directional long short-term memory (BLSTM) and bi-directional gated recurrent unit (BGRU) networks, along with a cross-attention fusion mechanism, allows for more nuanced feature extraction and integration. This approach is well-grounded in the theoretical understanding of the different temporal scales at which spectral and spatial features evolve, making it a thoughtful and innovative contribution to the field.
The experimental setup is robust, utilizing multiple datasets including WHAMR! and a newly generated WSJ0-Demand-6ch-Move dataset specifically designed for moving speaker scenarios. The results demonstrate clear improvements over state-of-the-art methods, with significant gains in scale-invariant signal-to-distortion ratio (SI-SDR) across varying acoustic conditions. The ablation studies provide valuable insights into the contributions of each component of the PS2 architecture, reinforcing the importance of the dual-branch design. However, the paper could benefit from a more detailed analysis of the computational efficiency and potential trade-offs involved in the proposed architecture.
The paper provides a comprehensive description of the architecture, training configurations, and datasets used, which supports reproducibility. However, the absence of publicly available code or a project URL limits the ability for others to replicate the results directly. Future work could include releasing the model and training scripts to enhance reproducibility within the research community.
One limitation of the study is the reliance on synthetic datasets for moving speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, while the model shows strong performance in various conditions, the authors do not extensively discuss its limitations in extreme acoustic scenarios or potential failure modes. The evaluation metrics primarily focus on SI-SDR, which, while important, may not encompass all aspects of speech quality and intelligibility.
The advancements in multi-channel speech separation have significant implications for various applications, including voice recognition systems, hearing aids, and telecommunication technologies. The ability to effectively separate moving speakers in dynamic environments could enhance user experiences in real-world applications, making this research particularly relevant in the context of increasing reliance on audio processing technologies in daily life. The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
Self-supervised learning (SSL) has transformed speech processing, with benchmarks such as SUPERB establishing fair comparisons across diverse downstream tasks. Despite it's security-critical importance, Audio deepfake detection has remained outside these efforts. In this work, we introduce Spoof-SUPERB, a benchmark for audio deepfake detection that systematically evaluates 20 SSL models spanning generative, discriminative, and spectrogram-based architectures. We evaluated these models on multiple in-domain and out-of-domain datasets. Our results reveal that large-scale discriminative models such as XLS-R, UniSpeech-SAT, and WavLM Large consistently outperform other models, benefiting from multilingual pretraining, speaker-aware objectives, and model scale. We further analyze the robustness of these models under acoustic degradations, showing that generative approaches degrade sharply, while discriminative models remain resilient. This benchmark establishes a reproducible baseline and provides practical insights into which SSL representations are most reliable for securing speech systems against audio deepfakes.
Primary: University of Michigan
All Institutions: University of Michigan
The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.
The paper introduces a novel benchmarking framework, Spoof-SUPERB, specifically designed for evaluating self-supervised learning (SSL) models in the context of audio deepfake detection. The methodology is well-structured, utilizing a unified protocol for training and evaluation across multiple datasets, which enhances comparability. The choice of models and the systematic evaluation of their performance under various conditions, including acoustic degradations, is a significant strength. However, the paper could benefit from a more detailed discussion of the specific training and evaluation protocols used, as well as the rationale behind the selection of datasets.
The experiments are comprehensive, involving 20 different SSL models evaluated on multiple datasets, which provides a robust analysis of model performance. The results clearly demonstrate the superiority of large-scale discriminative models over generative ones, particularly in terms of resilience to noise and other acoustic degradations. The use of Equal Error Rate (EER) as a performance metric is appropriate for the task, although additional metrics could provide a more nuanced view of model performance.
The paper emphasizes reproducibility by establishing a fixed training setup and evaluation protocol, which is crucial for benchmarking in machine learning. However, the absence of a publicly accessible code repository or detailed implementation guidelines limits the ability of other researchers to reproduce the results. Providing such resources would significantly enhance the paper's impact.
One limitation is the potential overlap between the pretraining data of some models and the evaluation datasets, which could bias the results. Additionally, while the paper addresses robustness under various acoustic conditions, it does not explore the implications of different synthesis methods for audio deepfakes, which could be a critical area for future research.
The introduction of a standardized benchmark for audio deepfake detection is a timely contribution, given the increasing prevalence of deepfake technologies and their implications for security and trust in audio communications. This work could pave the way for further advancements in antispoofing techniques and the development of more secure speech processing systems. The paper presents Spoof-SUPERB, a benchmark for evaluating SSL models in audio deepfake detection, filling a critical gap in the literature. The technical contributions are significant, providing a systematic framework for assessing model performance and robustness, which is essential for advancing the field of speech processing in the context of security.
A primary challenge in developing synthetic spatial hearing systems, particularly underwater, is accurately modeling sound scattering. Biological organisms achieve 3D spatial hearing by exploiting sound scattering off their bodies to generate location-dependent interaural level and time differences (ITD/ILD). While Head-Related Transfer Function (HRTF) models based on rigid scattering suffice for terrestrial humans, they fail in underwater environments due to the near-impedance match between water and soft tissue. Motivated by the acoustic anatomy of underwater animals, we introduce a novel, analytically derived, closed-form forward model for scattering from a semi-transparent sphere containing two rigid spherical scatterers. This model accurately maps source direction, frequency, and material properties to the pressure field, capturing the complex physics of layered, penetrable structures. Critically, our model is implemented in a fully differentiable setting, enabling its integration with a machine learning algorithm to optimize a cost function for active localization. We demonstrate enhanced convergence for localization under noise using a physics-informed frequency weighting scheme, and present accurate moving-source tracking via an Extended Kalman Filter (EKF) with analytically computed Jacobians. Our work suggests that differentiable models of scattering from layered rigid and transparent geometries offer a promising new foundation for microphone arrays that leverage scattering-based spatial cues over conventional beamforming, applicable to both terrestrial and underwater applications. Our model will be made open source.
Primary: University of Maryland
All Institutions: University of Maryland, SDU | All: & Reality Lab
The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.
The paper presents a novel analytical framework for modeling sound scattering in underwater environments, utilizing a differentiable multi-sphere scattering model. The approach is grounded in biological principles of spatial hearing and employs multipole expansions to derive a closed-form solution. The implementation in a differentiable programming framework (JAX) allows for efficient gradient-based optimization, which is a significant advancement over traditional methods that do not provide gradients. This differentiability enables the integration of the model with machine learning algorithms for active localization, showcasing a well-thought-out methodology that bridges physics-based modeling and machine learning.
The experiments conducted validate the proposed model through simulations that demonstrate its ability to accurately capture interaural level differences (ILD) and interaural time differences (ITD) under various conditions. The results show that the model effectively generates realistic binaural cues and performs robustly in source localization tasks, even under noise. The use of an Extended Kalman Filter (EKF) for tracking moving sources further emphasizes the practical applicability of the model. The experiments are comprehensive, covering various source directions and noise levels, which strengthens the findings.
While the paper mentions that the model will be made open source, specific details regarding the implementation and access to the code are not provided. This lack of direct access to the code and data limits the reproducibility of the results. However, the detailed methodology and equations presented allow for potential replication by researchers with sufficient expertise in the field.
The primary limitation of the study is the reliance on a simplified geometric model that may not capture all complexities of real-world underwater environments. Additionally, the experiments are conducted in a controlled simulation setting, which may not fully represent the challenges faced in practical applications, such as reverberation and multi-source scenarios. The model's performance in more complex acoustic environments remains to be tested.
This research has significant implications for the development of advanced acoustic sensing systems, particularly in underwater environments where traditional methods struggle. The ability to accurately model sound scattering and utilize spatial cues for localization can enhance various applications, including marine biology research, underwater navigation, and surveillance. The open-source nature of the model could foster further research and development in this area, promoting collaboration and innovation. The paper introduces a novel differentiable multi-sphere scattering model for underwater spatial audio cues, bridging the gap between biological principles and machine learning applications. The comprehensive methodology and robust experimental validation underscore its potential impact on acoustic sensing technologies.
Dysarthric speech exhibits abnormal prosody and significant speaker variability, presenting persistent challenges for automatic speech recognition (ASR). While text-to-speech (TTS)-based data augmentation has shown potential, existing methods often fail to accurately model the pathological rhythm and acoustic style of dysarthric speech. To address this, we propose DARS, a dysarthria-aware rhythm-style synthesis framework based on the Matcha-TTS architecture. DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction and pathological acoustic style simulation. Experiments on the TORGO dataset demonstrate that DARS achieves a Mean Cepstral Distortion (MCD) of 4.29, closely approximating real dysarthric speech. Adapting a Whisper-based ASR system with synthetic dysarthric speech from DARS achieves a 54.22% relative reduction in word error rate (WER) compared to state-of-the-art methods, demonstrating the framework's effectiveness in enhancing recognition performance.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFlytek Co., Ltd., Huawei Technology
The main contribution of this paper is the development of the DARS framework, which effectively synthesizes dysarthric speech to enhance automatic speech recognition performance, addressing a critical gap in assistive technology for individuals with speech impairments. The combination of innovative methodologies and rigorous experimental validation positions this work as a significant advancement in the field of speech synthesis and recognition.
The DARS framework introduces innovative mechanisms for synthesizing dysarthric speech, specifically a multi-stage rhythm predictor and a dysarthria-aware conditional flow matching mechanism. The use of contrastive preference optimization to guide the rhythm predictor is particularly novel, as it directly addresses the variability in dysarthric speech patterns. The integration of pause modeling and acoustic style vectors enhances the synthesis quality, making the approach well-suited for the complexities of dysarthric speech.
The paper presents a thorough experimental evaluation using the TORGO dataset, demonstrating the effectiveness of the DARS framework in enhancing ASR performance. The reported results, including a 54.22% relative reduction in WER, indicate significant improvements over existing methods. The experiments are well-structured, comparing multiple training strategies and adaptation techniques, which adds robustness to the findings.
While the paper provides a detailed description of the methodology and experimental setup, the absence of URLs for code or demo pages limits reproducibility. Clearer documentation or supplementary materials would enhance the ability for others to replicate the results.
The study relies on a limited dataset (TORGO), which may affect the generalizability of the results. Additionally, while the framework shows promise, the performance on more diverse dysarthric speech samples and real-world scenarios remains to be validated.
The DARS framework has the potential to significantly improve communication aids for individuals with dysarthria, enhancing their quality of life. By improving ASR systems' ability to recognize dysarthric speech, this research could facilitate better interaction and accessibility for affected individuals in various settings. The main contribution of this paper is the development of the DARS framework, which effectively synthesizes dysarthric speech to enhance automatic speech recognition performance, addressing a critical gap in assistive technology for individuals with speech impairments. The combination of innovative methodologies and rigorous experimental validation positions this work as a significant advancement in the field of speech synthesis and recognition.
Dysarthric speech reconstruction (DSR) typically employs a cascaded system that combines automatic speech recognition (ASR) and sentence-level text-to-speech (TTS) to convert dysarthric speech into normally-prosodied speech. However, dysarthric individuals often speak more slowly, leading to excessively long response times in such systems, rendering them impractical in long-speech scenarios. Cascaded DSR systems based on streaming ASR and incremental TTS can help reduce latency. However, patients with differing dysarthria severity exhibit substantial pronunciation variability for the same text, resulting in poor robustness of ASR and limiting the intelligibility of reconstructed speech. In addition, incremental TTS suffers from poor prosodic feature prediction due to a limited receptive field. In this study, we propose an end-to-end simultaneous DSR system with two key innovations: 1) A frame-level adaptor module is introduced to bridge ASR and TTS. By employing explicit-implicit semantic information fusion and joint module training, it enhances the error tolerance of TTS to ASR outputs. 2) A multiple wait-k autoregressive TTS module is designed to mitigate prosodic degradation via multi-view knowledge distillation. Our system has an average response time of 1.03 seconds on Tesla A100, with an average real-time factor (RTF) of 0.71. On the UASpeech dataset, it attains a mean opinion score (MOS) of 4.67 and demonstrates a 54.25% relative reduction in word error rate (WER) compared to the state-of-the-art. Our demo is available at: https://wflrz123.github.io/
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFlytek Co., Ltd.
The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.
The proposed end-to-end simultaneous dysarthric speech reconstruction (E2E-SDSR) system introduces innovative components such as a frame-level adaptor module and a multiple wait-k autoregressive TTS module. The frame-level adaptor effectively bridges the gap between ASR and TTS, enhancing the robustness of the system against ASR errors through explicit-implicit semantic information fusion. The multiple wait-k strategy in the TTS module allows for flexibility in processing, balancing latency and prosody quality. The methodology is well-structured, with a clear focus on addressing the unique challenges posed by dysarthric speech, particularly in terms of intelligibility and naturalness.
The experiments are comprehensive, utilizing both a commercial dysarthric speech dataset and the UASpeech dataset. The reported results, including a mean opinion score (MOS) of 4.67 and a 54.25% reduction in word error rate (WER), demonstrate significant improvements over existing methods. The ablation studies provide valuable insights into the contributions of each component of the proposed system, reinforcing the effectiveness of the adaptor and wait-k strategies.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as hyperparameter settings, training duration, and the exact architecture configurations used. The absence of a publicly available code repository limits the reproducibility of the results.
The study primarily focuses on dysarthric speech and may not generalize well to other speech disorders or languages. Additionally, the reliance on a limited dataset for training and testing could affect the robustness of the model in real-world applications. The paper does not address potential biases in the dataset or the implications of using commercial data.
The proposed system has the potential to significantly improve communication for individuals with dysarthria, enhancing their quality of life and social interactions. By providing a more efficient and intelligible speech reconstruction method, it could be applied in various assistive technologies and communication devices. The paper presents a novel end-to-end simultaneous dysarthric speech reconstruction system that effectively addresses the challenges of intelligibility and latency through innovative methodologies. The technical contributions are significant, with promising experimental results that indicate a meaningful advancement in the field of speech processing for individuals with speech impairments.
This report details our submission to the CHiME-9 MCoRec Challenge on recognizing and clustering multiple concurrent natural conversations within indoor social settings. Unlike conventional meetings centered on a single shared topic, this scenario contains multiple parallel dialogues--up to eight speakers across up to four simultaneous conversations--with a speech overlap rate exceeding 90%. To tackle this, we propose a multimodal cascaded system that leverages per-speaker visual streams extracted from synchronized 360 degree video together with single-channel audio. Our system improves three components of the pipeline by leveraging enhanced audio-visual pretrained models: Active Speaker Detection (ASD), Audio-Visual Target Speech Extraction (AVTSE), and Audio-Visual Speech Recognition (AVSR). The AVSR module further incorporates Whisper and LLM techniques to boost transcription accuracy. Our best single cascaded system achieves a Speaker Word Error Rate (WER) of 32.44% on the development set. By further applying ROVER to fuse outputs from diverse front-end and back-end variants, we reduce Speaker WER to 31.40%. Notably, our LLM-based zero-shot conversational clustering achieves a speaker clustering F1 score of 1.0, yielding a final Joint ASR-Clustering Error Rate (JACER) of 15.70%.
Primary: University of Science and Technology of China
All Institutions: Anhui University, Lomonosov Moscow State University, iFLYTEK Research, Shaanxi Normal University, University of Science and Technology of China, iFLYTEK Co
This paper makes a notable contribution to the field of audio-visual speech recognition and clustering by proposing an integrated framework that effectively addresses the challenges posed by overlapping conversations in complex acoustic environments. The technical contributions, particularly the innovative use of LLMs for conversation clustering, position this work as a significant advancement in the domain.
The paper presents a sophisticated multimodal cascaded system that integrates audio-visual data to tackle the complex problem of recognizing and clustering multiple concurrent conversations. The methodology is robust, employing a two-stage transfer learning strategy for Active Speaker Detection (ASD) and a comprehensive approach for Audio-Visual Target Speech Extraction (AVTSE) and Audio-Visual Speech Recognition (AVSR). The incorporation of large language models (LLMs) for conversational clustering is particularly innovative, leveraging semantic understanding to enhance accuracy. The use of diverse datasets for training and the detailed architecture of each system component demonstrate a thorough and well-thought-out methodology.
The experiments are extensive, with a clear focus on evaluating the performance of each component in the pipeline. The results indicate significant improvements over baseline models, particularly in Speaker Word Error Rate (WER) and Joint ASR-Clustering Error Rate (JACER). The paper provides detailed comparisons across different system configurations, showcasing the effectiveness of the proposed methods. However, the absence of a comprehensive ablation study to isolate the contributions of each component limits the depth of the evaluation.
While the paper outlines the methodologies and datasets used, it lacks specific implementation details that would aid in reproducing the results. There are no links to code repositories or supplementary materials, which is a significant drawback for reproducibility in machine learning research.
The paper does not address potential limitations of the proposed systems, such as the computational cost associated with using large language models and the challenges of real-time application in practical scenarios. Additionally, the reliance on extensive datasets may not be feasible for all research groups, limiting the accessibility of the proposed methods.
The work has significant implications for applications in real-world scenarios involving multi-party conversations, such as meetings, conferences, and social interactions. The ability to accurately recognize and cluster overlapping speech can enhance communication technologies, assistive devices, and automated transcription services, contributing to advancements in human-computer interaction. This paper makes a notable contribution to the field of audio-visual speech recognition and clustering by proposing an integrated framework that effectively addresses the challenges posed by overlapping conversations in complex acoustic environments. The technical contributions, particularly the innovative use of LLMs for conversation clustering, position this work as a significant advancement in the domain.
We introduce VietSuperSpeech, a large-scale Vietnamese automatic speech recognition (ASR) dataset of 52,023 audio-text pairs totaling 267.39 hours, with a distinctive focus on casual conversational speech. Unlike existing Vietnamese ASR corpora that predominantly feature read speech, news narration, or audiobook content, VietSuperSpeech is sourced from four publicly accessible YouTube channels spanning everyday conversation, personal vlogging, overseas Vietnamese community dialogue, and informal commentary - the very speech styles encountered in real-world chatbot, customer support, call center, and hotline deployments. All audio is standardized to 16 kHz mono PCM WAV and segmented into 3-30 second utterances. Transcriptions are generated via pseudo-labeling using the Zipformer-30M-RNNT-6000h model (Nguyen, 2025) deployed through Sherpa-ONNX, pre-trained on 6,000 hours of Vietnamese speech. After quality filtering, the dataset is split into 46,822 training samples (240.67 hours) and 5,201 development/test samples (26.72 hours) with a fixed random seed. The text averages 266 characters per utterance, totaling 13.8 million fully diacritically marked Vietnamese characters. We demonstrate that VietSuperSpeech fills a critical gap in the Vietnamese ASR ecosystem: while corpora such as VLSP2020, VIET_BUD500, VietSpeech, FLEURS, VietMed, Sub-GigaSpeech2-Vi, viVoice, and Sub-PhoAudioBook provide broad coverage of formal and read speech, none specifically targets the casual, spontaneous register indispensable for conversational AI applications. VietSuperSpeech is publicly released at https://huggingface.co/datasets/thanhnew2001/VietSuperSpeech.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of VietSuperSpeech, a large-scale dataset specifically designed for casual conversational speech in Vietnamese, which fills a critical gap in the existing ASR corpus landscape. This dataset's unique focus on informal speech patterns and its potential applications in various conversational AI domains make it a significant resource for advancing ASR technology in low-resource languages.
The methodology of VietSuperSpeech is robust, focusing on the collection of conversational speech from diverse YouTube channels, which is a significant departure from existing datasets that primarily feature formal speech. The use of pseudo-labeling through the Zipformer-30M-RNNT-6000h model is well-justified, and the quality control measures implemented during transcription generation strengthen the dataset's reliability. However, the paper could benefit from a more detailed description of the pseudo-labeling quality assessment process and the specific metrics used to evaluate the performance of the ASR model on this dataset.
The paper does not provide extensive experimental results demonstrating the effectiveness of the VietSuperSpeech dataset in improving ASR performance in conversational contexts. While the authors discuss the dataset's intended applications and the acoustic properties of the speech it contains, empirical validation through experiments that compare ASR performance on this dataset versus existing corpora would significantly enhance the paper's impact.
The authors have made the dataset publicly available, which is a positive step towards reproducibility. The details regarding the audio preprocessing and pseudo-labeling pipeline are adequately described, allowing other researchers to replicate the dataset creation process. However, the lack of shared experimental results or code for training ASR models on this dataset limits the overall reproducibility of the findings.
The paper acknowledges several limitations, including the potential for pseudo-label noise and the demographic balance of the speaker population. The dataset's reliance on YouTube content may also restrict its representativeness of all conversational registers, particularly in specialized domains. Additionally, the authors note that the dataset may not fully capture the nuances of highly noisy environments typical in call centers.
VietSuperSpeech has significant implications for the development of ASR systems in Vietnamese, particularly for applications in customer support, chatbots, and IVR systems. By addressing the gap in conversational speech datasets, it provides a valuable resource for researchers and practitioners aiming to improve ASR performance in real-world scenarios. The dataset's public availability encourages further research and development in this area, potentially leading to advancements in Vietnamese language technology. The main contribution of this work is the introduction of VietSuperSpeech, a large-scale dataset specifically designed for casual conversational speech in Vietnamese, which fills a critical gap in the existing ASR corpus landscape. This dataset's unique focus on informal speech patterns and its potential applications in various conversational AI domains make it a significant resource for advancing ASR technology in low-resource languages.
REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The paper presents AG-REPA, a framework that enhances audio generation quality by focusing on causal contributions of layers rather than mere representational richness. This innovative approach not only improves training efficiency but also contributes to the interpretability of generative models, marking a meaningful advancement in the field of machine learning.
The paper introduces a novel methodology called Attribution-Guided REPresentation Alignment (AG-REPA), which emphasizes causal layer selection for representation alignment in audio flow matching. This approach is grounded in the theoretical concept of Store-Contribute Dissociation (SCD), which reveals that layers rich in semantic information do not necessarily contribute most to the generative process. The methodology includes a forward-only gate ablation (FoG-A) to quantify each layer's causal contribution, allowing for adaptive layer selection and weighting. This is a significant advancement over traditional heuristic methods, providing a more principled basis for layer selection in generative models.
The experiments are robust, utilizing well-established datasets such as LibriSpeech and AudioSet for unified speech and general audio training. The results demonstrate that AG-REPA consistently outperforms baseline methods, achieving significant reductions in Fréchet Audio Distance (FAD) and improvements in perceptual quality metrics like Word Error Rate (WER) and Mean Opinion Score (MOS). The comparative analysis against static REPA baselines and other alignment strategies provides strong empirical support for the proposed method.
The paper outlines a clear methodology and experimental setup, but lacks specific implementation details or code availability, which could hinder reproducibility. The authors mention a "probe-then-intervene" training protocol that separates diagnostic probing from optimization, which is a good practice for ensuring clean experimental conditions.
One limitation is the lack of external validation on diverse datasets beyond LibriSpeech and AudioSet, which may limit the generalizability of the findings. Additionally, while the paper discusses potential risks associated with high-fidelity audio generation, it does not provide a detailed framework for mitigating these risks in practical applications.
The work has significant implications for the field of audio generation, particularly in enhancing the intelligibility and quality of synthesized speech and audio. The interpretability toolkit developed in this study could also pave the way for more transparent and controllable generative models in AI, addressing some of the ethical concerns surrounding deepfake technologies. The paper presents AG-REPA, a framework that enhances audio generation quality by focusing on causal contributions of layers rather than mere representational richness. This innovative approach not only improves training efficiency but also contributes to the interpretability of generative models, marking a meaningful advancement in the field of machine learning.
Multi-track music generation has garnered significant research interest due to its precise mixing and remixing capabilities. However, existing models often overlook essential attributes such as rhythmic stability and synchronization, leading to a focus on differences between tracks rather than their inherent properties. In this paper, we introduce SyncTrack, a synchronous multi-track waveform music generation model designed to capture the unique characteristics of multi-track music. SyncTrack features a novel architecture that includes track-shared modules to establish a common rhythm across all tracks and track-specific modules to accommodate diverse timbres and pitch ranges. Each track-shared module employs two cross-track attention mechanisms to synchronize rhythmic information, while each track-specific module utilizes learnable instrument priors to better represent timbre and other unique features. Additionally, we enhance the evaluation of multi-track music quality by introducing rhythmic consistency through three novel metrics: Inner-track Rhythmic Stability (IRS), Cross-track Beat Synchronization (CBS), and Cross-track Beat Dispersion (CBD). Experiments demonstrate that SyncTrack significantly improves the multi-track music quality by enhancing rhythmic consistency.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology
The paper presents SyncTrack, a novel model for synchronous multi-track music generation that significantly enhances rhythmic stability and synchronization. The technical contributions, including innovative architecture and evaluation metrics, position this work as a meaningful advancement in the field of machine learning for audio applications.
The paper introduces SyncTrack, a novel architecture for multi-track music generation that effectively addresses rhythmic stability and synchronization through the integration of track-shared and track-specific modules. The use of cross-track attention mechanisms is innovative, allowing for both global and time-specific synchronization of rhythms across tracks. The proposed metrics for evaluating rhythmic consistency (IRS, CBS, CBD) are well-conceived and fill a significant gap in the assessment of multi-track music generation quality. The methodology is clearly articulated, with a logical flow from problem identification to solution proposal.
The experiments are comprehensive, utilizing both objective metrics (FAD, IRS, CBS, CBD) and subjective evaluations to validate the performance of SyncTrack against state-of-the-art baselines. The results demonstrate significant improvements in both rhythmic stability and synchronization, with clear statistical backing. The use of the Slakh2100 dataset is appropriate for the task, and the ablation studies provide insights into the contributions of different components of the model.
The paper includes detailed implementation information, including training configurations, datasets, and evaluation metrics, which enhances reproducibility. However, the absence of a demo or project URL limits accessibility for other researchers wishing to replicate the work.
While the proposed metrics are robust, the paper does not address potential limitations in the generalizability of SyncTrack across different musical genres or styles. Additionally, the reliance on specific datasets may introduce biases that affect the model's performance in broader applications.
The advancements in multi-track music generation have significant implications for the music industry, particularly in areas such as music production, remixing, and creative applications. By improving rhythmic stability and synchronization, SyncTrack could enhance the quality of generated music, making it more suitable for professional use. The paper presents SyncTrack, a novel model for synchronous multi-track music generation that significantly enhances rhythmic stability and synchronization. The technical contributions, including innovative architecture and evaluation metrics, position this work as a meaningful advancement in the field of machine learning for audio applications.
Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15\% to 4.58\% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
Primary: Weizmann Institute of Science
All Institutions: Weizmann Institute of Science
The main contribution of this paper is the introduction of VoxKnesset, a large-scale longitudinal Hebrew speech dataset that enables the study of aging effects on speech, along with a comprehensive evaluation of modern speech embeddings in this context. This work represents a significant advancement in the field of speech processing, particularly for underrepresented languages and demographic studies.
The methodology employed in VoxKnesset is robust and well-structured, focusing on the creation of a large-scale longitudinal speech dataset specifically for Hebrew parliamentary speech. The authors detail a multi-stage alignment pipeline that addresses common issues in audio processing, such as timestamp inconsistencies and transcript normalization artifacts. The use of verified demographic metadata enhances the dataset's reliability. The longitudinal aspect of the dataset is particularly noteworthy, as it allows for the examination of vocal changes over time, a significant advancement over traditional cross-sectional datasets. The benchmarking of modern speech embeddings on age prediction and speaker verification is methodologically sound, providing a clear framework for evaluating the impact of aging on speech characteristics.
The experiments conducted are thorough and well-articulated, demonstrating the dataset's utility in real-world applications. The authors benchmark several state-of-the-art speech embeddings, providing a comprehensive analysis of their performance in both age prediction and speaker verification tasks. The results clearly show the degradation of speaker verification performance over time, highlighting the importance of longitudinal data in understanding vocal aging. The cross-dataset evaluations further validate the dataset's applicability and the robustness of the findings across different languages and contexts. The use of various metrics, including Mean Absolute Error (MAE) and Equal Error Rate (EER), adds depth to the evaluation.
While the paper mentions that the dataset and processing pipeline will be publicly released, specific implementation details are somewhat lacking. The authors provide a general overview of their methods, but more granular details about the experimental setup, hyperparameters, and the exact processing steps would enhance reproducibility. Clear documentation and access to code would be beneficial for other researchers looking to replicate or build upon this work.
The paper acknowledges several limitations, including the dataset's focus on a single speech register (parliamentary debate) and a demographic skew towards older adults. Additionally, the authors note that recording conditions may have evolved over the 16 years, which could introduce confounding variables. The challenge of disentangling channel drift from biological aging is also recognized, indicating that further research is needed to fully understand these dynamics.
The VoxKnesset dataset has significant implications for various applications, including biometric security, automated transcription, and health diagnostics. By addressing the aging of vocal characteristics, this work could lead to more robust and reliable speech processing systems that can adapt to individual changes over time. The dataset's release will likely stimulate further research in Hebrew speech processing and aging-related studies, contributing to the broader field of machine learning and speech technology. The main contribution of this paper is the introduction of VoxKnesset, a large-scale longitudinal Hebrew speech dataset that enables the study of aging effects on speech, along with a comprehensive evaluation of modern speech embeddings in this context. This work represents a significant advancement in the field of speech processing, particularly for underrepresented languages and demographic studies.
Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.
Primary: Paul G. Allen School of Computer Science and Engineering, University of Washington
All Institutions: Paul G. Allen School of Computer Science and Engineering, University of Washington, Hearvana AI
Aurchestra introduces a groundbreaking approach to soundscape control on hearables, enabling users to manipulate multiple sound sources independently in real-time. The combination of innovative methodology and practical applications positions this work as a significant contribution to the field of audio machine learning, although further enhancements in reproducibility and comparative analysis are necessary for broader acceptance.
The methodology presented in Aurchestra is innovative as it combines a dynamic interface with a real-time multi-output extraction network tailored for resource-constrained hearables. The authors detail a robust architecture that allows for the detection and manipulation of multiple overlapping sound classes, which is a significant advancement over traditional binary noise cancellation systems. The use of on-device processing is particularly noteworthy, as it addresses the limitations of latency and resource usage in mobile devices. However, the paper could benefit from a more detailed description of the model architecture and the specific algorithms used for sound class detection and extraction.
The experimental evaluation is thorough, demonstrating the system's performance in real-world environments with diverse acoustic scenarios. The authors report substantial improvements in target-class enhancement and interference suppression, which are critical metrics for the effectiveness of soundscape control. However, the paper lacks a comprehensive comparison with existing methods, which would help contextualize the results and validate the claimed improvements. Additionally, further details on the datasets used for training and testing would enhance the credibility of the experimental findings.
The paper does not provide sufficient details regarding the implementation of the system, such as the specific datasets, training procedures, or hyperparameters used in the model. This lack of transparency could hinder reproducibility, which is a crucial aspect of machine learning research. Including a supplementary material section or a dedicated repository with code and data would significantly improve this aspect.
While the system shows promising results, there are limitations that need to be addressed. The performance in highly dynamic or noisy environments is not fully explored, and the scalability of the approach to more than five overlapping sounds is unclear. Additionally, the user interface design and user experience aspects are briefly mentioned but not thoroughly evaluated, which could impact the system's practical usability.
Aurchestra has the potential to revolutionize personal audio experiences, making it applicable in various fields such as augmented reality, hearing aids, and smart environments. By allowing users to customize their auditory experiences in real-time, it could enhance accessibility for individuals with hearing impairments and improve the overall quality of life in noisy urban settings. The implications for privacy and user control over their auditory environment are also significant, as this technology could empower users to manage their soundscapes actively. Aurchestra introduces a groundbreaking approach to soundscape control on hearables, enabling users to manipulate multiple sound sources independently in real-time. The combination of innovative methodology and practical applications positions this work as a significant contribution to the field of audio machine learning, although further enhancements in reproducibility and comparative analysis are necessary for broader acceptance.
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.
Primary: Queen Mary University of London
All Institutions: Queen Mary University of London, Peking University, Technical University of Munich, Beijing University of Post and Telecommunications, SooChow University, University of Manchester, Hong Kong University of Science and Technology
This paper presents a comprehensive framework for evaluating music generation models through a novel benchmark and datasets, significantly advancing the state of the art in music reward modeling. The methodology is rigorous, and the results demonstrate a clear improvement in aligning model outputs with human preferences, making it a valuable contribution to the field of machine learning and music technology.
The paper introduces a novel framework for evaluating music generation models using Compositional Multimodal Instruction (CMI). It constructs two datasets—CMI-Pref-Pseudo and CMI-Pref—alongside a unified benchmark, CMI-RewardBench, which assesses models on multiple dimensions of musicality and alignment. The methodology is robust, utilizing both pseudo-labeling and expert annotations, and employs a parameter-efficient architecture for the reward models, allowing for effective processing of heterogeneous inputs. The two-stage training strategy enhances the model's performance by leveraging both large-scale pseudo-labeled data and high-quality human annotations.
The experiments are comprehensive, demonstrating the effectiveness of the proposed CMI-RM against existing baselines across various tasks. The results show strong correlations with human judgments, indicating that the proposed models can effectively evaluate music generation quality. The paper provides detailed metrics and comparisons, showcasing the advantages of the CMI-RewardBench in capturing the nuances of human preferences in music generation.
The authors have made their datasets, benchmark, and model weights publicly available, which enhances reproducibility. The detailed methodology, including the training protocols and evaluation metrics, is well-documented, allowing other researchers to replicate the experiments. However, the reliance on specific models for pseudo-labeling may introduce variability that is not fully accounted for.
One limitation is the potential bias in the pseudo-labeling process, which may affect the quality of the training data. Additionally, while the framework addresses the complexity of multimodal inputs, the evaluation may still be subjective, as musicality and alignment can vary significantly based on individual listener preferences. The paper also does not extensively discuss the scalability of the approach to larger datasets or different musical genres.
This work has significant implications for the field of music generation and evaluation, providing a structured approach to assess models that can handle complex multimodal inputs. The availability of the datasets and benchmark can spur further research in aligned music generation and improve the quality of AI-generated music in commercial applications. The methodology could also be adapted for use in other creative domains where multimodal inputs are prevalent. This paper presents a comprehensive framework for evaluating music generation models through a novel benchmark and datasets, significantly advancing the state of the art in music reward modeling. The methodology is rigorous, and the results demonstrate a clear improvement in aligning model outputs with human preferences, making it a valuable contribution to the field of machine learning and music technology.
The Transformer-based Whisper model has achieved state-of-the-art performance in Automatic Speech Recognition (ASR). However, its Multi-Head Attention (MHA) mechanism results in significant GPU memory consumption due to the linearly growing Key-Value (KV) cache usage, which is problematic for many applications especially with long-form audio. To address this, we introduce Whisper-MLA, a novel architecture that incorporates Multi-Head Latent Attention (MLA) into the Whisper model. Specifically, we adapt MLA for Whisper's absolute positional embeddings and systematically investigate its application across encoder self-attention, decoder self-attention, and cross-attention modules. Empirical results indicate that applying MLA exclusively to decoder self-attention yields the desired balance between performance and memory efficiency. Our proposed approach allows conversion of a pretrained Whisper model to Whisper-MLA with minimal fine-tuning. Extensive experiments on the LibriSpeech benchmark validate the effectiveness of this conversion, demonstrating that Whisper-MLA reduces the KV cache size by up to 87.5% while maintaining competitive accuracy.
Primary: Tianjin University
All Institutions: Tianjin University
The paper presents Whisper-MLA, a novel architecture that reduces GPU memory consumption in ASR models while preserving performance. This work is significant as it addresses a critical bottleneck in deploying state-of-the-art ASR systems, particularly for long-form audio applications, thereby enhancing accessibility and usability in various practical contexts.
The proposed methodology introduces a novel architecture, Whisper-MLA, which effectively integrates Multi-Head Latent Attention (MLA) into the Whisper model. The authors adapt MLA specifically for absolute positional embeddings, which is a significant innovation given the existing limitations of applying MLA to encoder-decoder architectures. The systematic investigation of MLA's application across different attention modules is commendable, and the decision to focus on decoder self-attention for optimization reflects a well-thought-out approach to balancing memory efficiency and performance.
The experiments conducted on the LibriSpeech benchmark are extensive and demonstrate the effectiveness of the Whisper-MLA model in reducing GPU memory consumption significantly while maintaining competitive accuracy. The results clearly show that the proposed model achieves up to 87.5% reduction in KV cache size, which is a critical metric for real-world applications, especially in resource-constrained environments. The comparative analysis with the original Whisper model provides a solid basis for the claims made.
The paper provides sufficient details regarding the experimental setup, including the model architecture, training parameters, and the dataset used. However, the lack of a publicly available code repository limits the reproducibility of the results. The authors mention that their source code is publicly available, but no specific URL is provided in the text, which is a missed opportunity for enhancing reproducibility.
One limitation is that the paper primarily focuses on the decoder self-attention mechanism, potentially overlooking the benefits that could be derived from optimizing other components of the model. Additionally, while the results are promising, the experiments are limited to the LibriSpeech dataset, which may not fully represent the model's performance across diverse ASR tasks and environments.
The Whisper-MLA architecture has significant implications for deploying large-scale ASR models in real-world applications, particularly in scenarios where GPU memory is a limiting factor. By reducing memory consumption while maintaining performance, this work could facilitate the use of advanced ASR technologies in mobile devices, embedded systems, and other resource-constrained environments. The paper presents Whisper-MLA, a novel architecture that reduces GPU memory consumption in ASR models while preserving performance. This work is significant as it addresses a critical bottleneck in deploying state-of-the-art ASR systems, particularly for long-form audio applications, thereby enhancing accessibility and usability in various practical contexts.
Symbolic music generation is a challenging task in multimedia generation, involving long sequences with hierarchical temporal structures, long-range dependencies, and fine-grained local details. Though recent diffusion-based models produce high quality generations, they tend to suffer from high training and inference costs with long symbolic sequences due to iterative denoising and sequence-length-related costs. To deal with such problem, we put forth a diffusing strategy named SMDIM to combine efficient global structure construction and light local refinement. SMDIM uses structured state space models to capture long range musical context at near linear cost, and selectively refines local musical details via a hybrid refinement scheme. Experiments performed on a wide range of symbolic music datasets which encompass various Western classical music, popular music and traditional folk music show that the SMDIM model outperforms the other state-of-the-art approaches on both the generation quality and the computational efficiency, and it has robust generalization to underexplored musical styles. These results show that SMDIM offers a principled solution for long-sequence symbolic music generation, including associated attributes that accompany the sequences. We provide a project webpage with audio examples and supplementary materials at https://3328702107.github.io/smdim-music/.
Primary: Hubei University of Technology
All Institutions: Hubei University of Technology, Hubei Key Laboratory of Digital Finance Innovation, Hubei University of Economics, Wuhan University of Technology
The main contribution of this paper is the introduction of SMDIM, a novel diffusion-based architecture that effectively addresses the challenges of long-sequence symbolic music generation by integrating structured state space models and hybrid refinement techniques. This work represents a meaningful advancement in the field, offering both theoretical insights and practical applications that could impact the future of music generation technologies.
The proposed SMDIM framework innovatively integrates structured state space models with diffusion modeling to address the challenges of long-sequence symbolic music generation. The methodology is well-articulated, with a clear explanation of the hybrid architecture that balances global structure modeling and local detail refinement. The introduction of the MFA block, which combines Mamba layers, feed-forward networks, and self-attention, is a significant contribution that enhances both efficiency and expressiveness. The theoretical underpinnings are solid, and the approach is tailored to the unique requirements of symbolic music, making it a meaningful advancement in the field.
The experimental evaluation is robust, utilizing a diverse set of datasets (MAESTRO, POP909, and FolkDB) that cover various musical styles. The paper presents comprehensive results, demonstrating that SMDIM outperforms state-of-the-art models in both generation quality and computational efficiency. The use of objective metrics such as average overlap area (OA) provides a quantitative basis for the claims, and the subjective evaluations through listening tests add depth to the assessment of musical quality. The ablation studies further strengthen the findings by elucidating the contributions of different components within the model.
The paper includes sufficient details about the training process, hyperparameters, and model architecture, which are essential for reproducibility. However, the absence of a public code repository limits the ease of reproduction. While the methodology is described in detail, having an accessible implementation would enhance the ability of other researchers to validate and build upon this work.
The paper acknowledges certain limitations, such as the model's tendency to produce musically implausible pitch ranges and overly dense vertical note stacking. Additionally, the structural coherence of generated music may degrade in longer compositions, indicating challenges in maintaining global musical form. These limitations suggest areas for future research, including the incorporation of constraints to improve musical plausibility and coherence.
The implications of this research are significant for the fields of music generation and multimedia content creation. By improving the efficiency and quality of symbolic music generation, SMDIM could facilitate advancements in automated music composition, interactive music applications, and educational tools for music theory. The model's ability to generalize across diverse musical styles also opens avenues for cross-cultural music generation, potentially enriching the landscape of automated music creation. The main contribution of this paper is the introduction of SMDIM, a novel diffusion-based architecture that effectively addresses the challenges of long-sequence symbolic music generation by integrating structured state space models and hybrid refinement techniques. This work represents a meaningful advancement in the field, offering both theoretical insights and practical applications that could impact the future of music generation technologies.
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DashengTokenizer, a unified audio tokenizer that enhances both understanding and generation tasks by inverting the traditional paradigm of acoustic tokenization. This innovative approach, alongside its competitive performance across various benchmarks, positions it as a significant advancement in the field of audio machine learning.
The methodology presented in the paper is innovative, as it proposes the DashengTokenizer, which inverts the conventional approach of audio tokenization by leveraging frozen semantic features to inject acoustic information. The simplicity of using a linear projection for acoustic injection is a notable strength, making the method efficient and accessible. However, the paper could benefit from a more detailed explanation of the training process and the selection of hyperparameters.
The experimental evaluation is robust, covering a wide range of tasks across understanding and generation domains, with comparisons to existing state-of-the-art methods. The use of diverse datasets and benchmarks strengthens the findings, and the results indicate significant improvements over traditional methods, particularly in understanding tasks. However, more detailed statistical analysis of the results could enhance the credibility of the claims.
The paper provides a clear overview of the architecture and training setup, which aids reproducibility. The availability of checkpoints on Hugging Face is a positive aspect, although the paper lacks specific implementation details that could help other researchers replicate the experiments more easily.
One limitation is the potential overfitting to the specific datasets used for training and evaluation, which may not generalize to all audio tasks. Additionally, the reliance on a frozen semantic encoder may limit adaptability to new domains without retraining.
The DashengTokenizer has the potential to significantly impact audio understanding and generation tasks, making it a valuable tool for applications in speech recognition, music analysis, and environmental sound classification. Its efficiency and performance could lead to broader adoption in real-world applications, particularly in areas requiring high-fidelity audio processing. The main contribution of this paper is the introduction of DashengTokenizer, a unified audio tokenizer that enhances both understanding and generation tasks by inverting the traditional paradigm of acoustic tokenization. This innovative approach, alongside its competitive performance across various benchmarks, positions it as a significant advancement in the field of audio machine learning.
Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of online registers and a future prediction loss to improve dual-mode self-supervised speech models, addressing the challenge of missing future context in streaming scenarios. This research presents a meaningful advancement in the field of speech recognition, combining innovative methodology with rigorous experimental validation to enhance model performance in real-time applications.
The proposed methodology introduces online registers as learnable tokens that serve as virtual placeholders for future context in dual-mode self-supervised speech models. This approach is innovative as it addresses the critical issue of attention mismatch in streaming scenarios without increasing latency. The incorporation of a future prediction loss further enhances the model's ability to retain useful predictive information. The design is well-structured, leveraging existing frameworks while introducing novel components that effectively bridge the gap between offline and online processing.
The experiments conducted on the LibriSpeech dataset and out-of-domain benchmarks provide a solid evaluation of the proposed method. The reported 3.4% relative improvement in word error rate (WER) demonstrates the effectiveness of online registers, particularly in low-latency settings. The comparison with existing methods like wav2vec 2.0 and UFO2 highlights the competitive performance of the proposed approach. However, the paper could benefit from more extensive experimentation across diverse datasets to validate the generalizability of the findings.
The paper provides detailed implementation information, including model architecture, training procedures, and hyperparameter settings, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results independently.
One notable limitation is the potential overfitting observed when increasing the number of online registers. The paper indicates that using too many registers may degrade performance, suggesting a need for careful tuning. Additionally, the future prediction loss's effectiveness appears to vary depending on the dataset and chunk size, indicating that its benefits may not be universally applicable.
The proposed method has significant implications for real-time speech recognition systems, especially in applications requiring low-latency processing. By effectively mitigating the lack of future context, this research could enhance the performance of speech models in various domains, including virtual assistants, transcription services, and accessibility tools for the hearing impaired. The lightweight nature of the online registers also suggests potential for deployment in resource-constrained environments. The main contribution of this paper is the introduction of online registers and a future prediction loss to improve dual-mode self-supervised speech models, addressing the challenge of missing future context in streaming scenarios. This research presents a meaningful advancement in the field of speech recognition, combining innovative methodology with rigorous experimental validation to enhance model performance in real-time applications.
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.
Primary: Tencent AI Lab
All Institutions: Tencent AI Lab
The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
The paper introduces SemanticVocoder, a novel approach that replaces traditional VAE acoustic latents with semantic latents for audio generation. The methodology is well-structured, leveraging a flow-matching approach to synthesize waveforms directly from semantic representations, thus addressing the limitations of conventional VAE-based systems. The use of a pretrained MAE encoder for extracting semantic latents is a significant innovation, enabling the model to focus on high-level semantic information rather than low-level acoustic details. The proposed architecture effectively balances the optimization difficulty across the text-to-latent and latent-to-waveform stages, which is a notable advancement in the field.
The experiments are comprehensive, utilizing multiple datasets (AudioCaps, AudioSet, and WavCaps) to evaluate the performance of SemanticVocoder against existing models. The reported results demonstrate superior performance in terms of Fréchet Distance and Fréchet Audio Distance, indicating that the model generates audio closer to real distributions. Additionally, the paper includes evaluations on audio understanding tasks, showcasing the discriminative power of semantic latents. However, the lack of subjective evaluation metrics is a minor drawback.
The paper provides detailed implementation specifics, including model architecture, training parameters, and datasets used, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for external researchers.
The paper acknowledges limitations such as dependency on the pretrained semantic encoder's performance, constraints on audio length generation, and the need for subjective evaluations to complement objective metrics. These factors could impact the model's applicability in real-world scenarios.
SemanticVocoder has the potential to significantly advance the field of audio generation and understanding by providing a unified framework that leverages semantic information. This could lead to improved applications in areas such as content creation, audio synthesis, and interactive media, thereby enhancing user experiences in various domains. The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Primary: Dartmouth College
All Institutions: Dartmouth College, Hume AI
The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
The proposed methodology introduces a novel tokenization scheme that achieves one-to-one synchronization between text and acoustic features, which is a significant advancement over traditional fixed-frame-rate approaches. The use of a dual alignment mechanism and a flow matching head within a large language model framework allows for efficient and high-fidelity speech synthesis. The architecture is well-structured, leveraging a combination of variational autoencoders and transformer-based models, which enhances the model's ability to handle both modalities concurrently. The approach to mitigate the modality gap through Speech Free Guidance (SFG) is particularly innovative, allowing for flexible integration of text and speech modalities.
The experiments conducted demonstrate the effectiveness of the proposed model against state-of-the-art TTS and SLM systems. The authors provide a comprehensive evaluation using multiple datasets and metrics, including character error rate, speaker similarity, and subjective evaluations of naturalness. The results indicate that the proposed method not only matches but often exceeds the performance of existing models, particularly in terms of reducing content hallucinations and improving inference efficiency. The extensive dataset used for training and evaluation further supports the robustness of the findings.
The paper includes a link to the GitHub repository containing the code and pre-trained models, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific configurations could be elaborated further to enhance clarity for future researchers attempting to replicate the results.
One limitation noted is the potential for speaker drifting during long-form generation, which suggests that while the model performs well in many scenarios, there are still challenges in maintaining speaker consistency over extended outputs. Additionally, the subjective evaluations indicate that while the model performs competitively, there is room for improvement in perceptual audio quality.
The implications of this research are significant for the field of speech synthesis and spoken language modeling. The ability to generate high-fidelity speech with reduced hallucinations and improved efficiency can enhance applications in voice cloning, virtual assistants, and interactive AI systems. The methodology could also pave the way for further advancements in multimodal AI systems, where seamless integration of text and speech is crucial. The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
Primary: KUET
All Institutions: KUET, BUET
The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.
The methodology presented in the paper is robust, focusing on two critical areas: Automatic Speech Recognition (ASR) and speaker diarization for Bengali audio. The authors systematically evaluated various architectures and fine-tuning strategies, demonstrating a clear understanding of the challenges associated with long-form audio processing. The innovative use of perfectly aligned annotations and synthetic acoustic degradation for ASR is particularly noteworthy. The shift from model retraining to heuristic post-processing for diarization reflects a pragmatic approach to overcoming the limitations of existing models in this domain.
The experiments conducted are thorough, with a clear focus on optimizing inference time and accuracy. The introduction of the Lipi-Ghor-882 dataset is a significant contribution, as it addresses the scarcity of resources for Bengali ASR and diarization. The results indicate a well-structured evaluation process, with the authors providing insights into the performance of various models and the effectiveness of their proposed methods. The achievement of a Real-Time Factor (RTF) of 0.019 is impressive and establishes a benchmark for future work.
While the paper outlines the methodologies and experiments in detail, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the findings or build upon this work.
The paper acknowledges several limitations, including compute resource constraints, data quality issues, and the challenges of model optimization for Bengali features. The reliance on heuristic post-processing for diarization, while effective, indicates that further research is needed to develop more robust models for this task.
This research has significant implications for the field of speech processing, particularly for low-resource languages like Bengali. The introduction of a large-scale dataset and the development of optimized pipelines can facilitate advancements in conversational AI, making it more accessible for Bengali speakers. The findings may also inspire similar approaches in other low-resource language contexts. The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Primary: Fudan University
All Institutions: Fudan University, Tencent Youtu Lab
The main contribution of this paper is the introduction of AV-LMMDetect, a large multimodal model for audio-visual deepfake detection that utilizes a novel two-stage training approach, achieving state-of-the-art results and demonstrating the potential of large models in enhancing detection capabilities. The methodology and results present a meaningful advancement in the field, addressing critical challenges in deepfake detection and setting a foundation for future research.
The paper introduces AV-LMMDetect, a novel supervised fine-tuned large multimodal model for audio-visual deepfake detection, employing a two-stage training strategy that combines lightweight LoRA alignment with full fine-tuning of audio-visual encoders. This methodology is innovative as it reformulates the detection task into a binary question-answering format, which is a fresh approach in the context of deepfake detection. The use of a large multimodal model also enhances the model's ability to capture cross-modal inconsistencies, a significant improvement over traditional methods that often rely on smaller, task-specific models.
The experimental results demonstrate that AV-LMMDetect achieves state-of-the-art performance on the MAVOS-DD dataset and competitive results on FakeAVCeleb. The paper provides thorough evaluations across multiple scenarios, including in-domain and open-set conditions, showcasing the model's robustness and generalization capabilities. The use of standard binary classification metrics (accuracy, AUC, and mAP) adds rigor to the evaluation process, although the paper could benefit from more detailed comparisons with additional baseline methods.
The paper lacks explicit details regarding the implementation and hyperparameter settings, which are crucial for reproducibility. While the methodology is described, the absence of a dedicated section on experimental setup and code availability limits the ability of other researchers to replicate the results. Including a link to a code repository or supplementary materials would significantly enhance reproducibility.
One limitation of the study is the reliance on two specific datasets, which may not fully represent the diversity of deepfake techniques in real-world applications. Additionally, while the model shows strong performance, the paper does not address potential biases in the datasets or the model's performance across different demographic groups, which is critical for ethical considerations in deployment.
The implications of this research are significant, as robust audio-visual deepfake detection is crucial for maintaining media integrity and public trust in an era of increasing misinformation. The proposed model could be applied in various domains, including social media, journalism, and law enforcement, to identify and mitigate the risks posed by deepfake technology. The main contribution of this paper is the introduction of AV-LMMDetect, a large multimodal model for audio-visual deepfake detection that utilizes a novel two-stage training approach, achieving state-of-the-art results and demonstrating the potential of large models in enhancing detection capabilities. The methodology and results present a meaningful advancement in the field, addressing critical challenges in deepfake detection and setting a foundation for future research.
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.
Primary: Tampere University
All Institutions: Tampere University, Nokia Technologies
The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
The proposed dual-branch parallel spectral-spatial (PS2) architecture represents a significant methodological advancement in the field of multi-channel speech separation. By separating the processing of spectral and spatial features, the authors effectively address the inherent modeling conflicts present in existing sequential architectures. The use of bi-directional long short-term memory (BLSTM) and bi-directional gated recurrent unit (BGRU) networks, along with a cross-attention fusion mechanism, allows for more nuanced feature extraction and integration. This approach is well-grounded in the theoretical understanding of the different temporal scales at which spectral and spatial features evolve, making it a thoughtful and innovative contribution to the field.
The experimental setup is robust, utilizing multiple datasets including WHAMR! and a newly generated WSJ0-Demand-6ch-Move dataset specifically designed for moving speaker scenarios. The results demonstrate clear improvements over state-of-the-art methods, with significant gains in scale-invariant signal-to-distortion ratio (SI-SDR) across varying acoustic conditions. The ablation studies provide valuable insights into the contributions of each component of the PS2 architecture, reinforcing the importance of the dual-branch design. However, the paper could benefit from a more detailed analysis of the computational efficiency and potential trade-offs involved in the proposed architecture.
The paper provides a comprehensive description of the architecture, training configurations, and datasets used, which supports reproducibility. However, the absence of publicly available code or a project URL limits the ability for others to replicate the results directly. Future work could include releasing the model and training scripts to enhance reproducibility within the research community.
One limitation of the study is the reliance on synthetic datasets for moving speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, while the model shows strong performance in various conditions, the authors do not extensively discuss its limitations in extreme acoustic scenarios or potential failure modes. The evaluation metrics primarily focus on SI-SDR, which, while important, may not encompass all aspects of speech quality and intelligibility.
The advancements in multi-channel speech separation have significant implications for various applications, including voice recognition systems, hearing aids, and telecommunication technologies. The ability to effectively separate moving speakers in dynamic environments could enhance user experiences in real-world applications, making this research particularly relevant in the context of increasing reliance on audio processing technologies in daily life. The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.
Primary: Jilin University
All Institutions: Hunan University, Jilin University, Shandong University, University of Electronic Science and Technology of China
UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.
The methodology presented in UniWhisper is innovative, focusing on a unified instruction and answer format for continual multi-task training. This approach effectively addresses the limitations of existing audio encoders that excel in specific domains but struggle with others. By leveraging a single encoder and a compact pretrained language model as the decoder, the authors streamline the training process and reduce redundancy in audio token representation. The decision to utilize shallow MLP probes and kNN for evaluation is appropriate, as it allows for a clear assessment of the model's performance across diverse tasks.
The experimental setup is robust, utilizing a substantial training dataset of 38k hours of public audio, which enhances the generalizability of the results. The evaluation across 20 tasks provides a comprehensive view of the model's capabilities. The results indicate that UniWhisper outperforms existing models like Whisper, HuBERT, and others, particularly in non-speech tasks, which is a significant achievement. The use of normalized weighted averages for performance metrics is a strong point, ensuring comparability across different tasks.
The paper mentions that code and pretrained weights will be released upon acceptance, which is crucial for reproducibility. However, the details provided in the methodology, such as specific hyperparameters and training configurations, are adequately described, allowing other researchers to replicate the experiments. The use of a compact pretrained language model as a decoder is also a noteworthy detail that aids in understanding the architecture.
One limitation is the reliance on a single encoder, which may not capture all nuances across diverse audio tasks as effectively as dual-encoder systems. Additionally, while the results are promising, the paper does not extensively discuss the potential trade-offs in performance when adapting UniWhisper to other audio domains not covered in the evaluation.
The implications of this work are significant, as it proposes a more efficient method for training universal audio representations that can be applied in various applications, including speech recognition, environmental sound classification, and music analysis. The potential for reduced training costs and improved performance across multiple tasks could lead to advancements in audio processing technologies, making them more accessible and effective in real-world applications. UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.
The methodology presented in TG-ASR is innovative, particularly the introduction of the Parallel Gated Cross-Attention (PGCA) mechanism, which adaptively integrates multilingual translation embeddings into the ASR decoder. This approach is well-justified, addressing the specific challenges of low-resource languages by leveraging auxiliary languages to improve transcription accuracy. The two-stage training process is clearly articulated, ensuring that the model benefits from both initial fine-tuning and subsequent integration of multilingual embeddings. However, the reliance on pre-trained models for auxiliary language embeddings and the potential noise introduced by machine translations are notable considerations.
The experiments are comprehensive, utilizing the newly created YT-THDC corpus, which is a significant contribution to the field of low-resource ASR. The results demonstrate a substantial reduction in character error rate (CER), validating the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of various components of the PGCA mechanism, reinforcing the robustness of the findings. However, the paper could benefit from additional comparative analyses with state-of-the-art models to contextualize the performance gains more clearly.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly accessible code repository or demo limits the ability for other researchers to replicate the results fully. The authors should consider releasing their code and model weights to facilitate further research and validation.
The study acknowledges several limitations, including the size and domain specificity of the YT-THDC corpus, which may restrict generalizability. Additionally, the reliance on auxiliary translations introduces potential noise that could affect performance. The findings are also specific to Taiwanese Hokkien, and the effectiveness of the approach for other low-resource languages remains to be validated.
The work has significant implications for the preservation of endangered languages and the accessibility of media content. By improving ASR for Taiwanese Hokkien, the research contributes to cultural preservation efforts and enhances bilingual accessibility. The methodology could be adapted for other low-resource languages, potentially benefiting a wider range of linguistic communities. The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.