Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.
Primary: Tencent AI Lab
All Institutions: Tencent AI Lab
The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
The paper introduces SemanticVocoder, a novel approach that replaces traditional VAE acoustic latents with semantic latents for audio generation. The methodology is well-structured, leveraging a flow-matching approach to synthesize waveforms directly from semantic representations, thus addressing the limitations of conventional VAE-based systems. The use of a pretrained MAE encoder for extracting semantic latents is a significant innovation, enabling the model to focus on high-level semantic information rather than low-level acoustic details. The proposed architecture effectively balances the optimization difficulty across the text-to-latent and latent-to-waveform stages, which is a notable advancement in the field.
The experiments are comprehensive, utilizing multiple datasets (AudioCaps, AudioSet, and WavCaps) to evaluate the performance of SemanticVocoder against existing models. The reported results demonstrate superior performance in terms of Fréchet Distance and Fréchet Audio Distance, indicating that the model generates audio closer to real distributions. Additionally, the paper includes evaluations on audio understanding tasks, showcasing the discriminative power of semantic latents. However, the lack of subjective evaluation metrics is a minor drawback.
The paper provides detailed implementation specifics, including model architecture, training parameters, and datasets used, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for external researchers.
The paper acknowledges limitations such as dependency on the pretrained semantic encoder's performance, constraints on audio length generation, and the need for subjective evaluations to complement objective metrics. These factors could impact the model's applicability in real-world scenarios.
SemanticVocoder has the potential to significantly advance the field of audio generation and understanding by providing a unified framework that leverages semantic information. This could lead to improved applications in areas such as content creation, audio synthesis, and interactive media, thereby enhancing user experiences in various domains. The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Primary: Dartmouth College
All Institutions: Dartmouth College, Hume AI
The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
The proposed methodology introduces a novel tokenization scheme that achieves one-to-one synchronization between text and acoustic features, which is a significant advancement over traditional fixed-frame-rate approaches. The use of a dual alignment mechanism and a flow matching head within a large language model framework allows for efficient and high-fidelity speech synthesis. The architecture is well-structured, leveraging a combination of variational autoencoders and transformer-based models, which enhances the model's ability to handle both modalities concurrently. The approach to mitigate the modality gap through Speech Free Guidance (SFG) is particularly innovative, allowing for flexible integration of text and speech modalities.
The experiments conducted demonstrate the effectiveness of the proposed model against state-of-the-art TTS and SLM systems. The authors provide a comprehensive evaluation using multiple datasets and metrics, including character error rate, speaker similarity, and subjective evaluations of naturalness. The results indicate that the proposed method not only matches but often exceeds the performance of existing models, particularly in terms of reducing content hallucinations and improving inference efficiency. The extensive dataset used for training and evaluation further supports the robustness of the findings.
The paper includes a link to the GitHub repository containing the code and pre-trained models, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific configurations could be elaborated further to enhance clarity for future researchers attempting to replicate the results.
One limitation noted is the potential for speaker drifting during long-form generation, which suggests that while the model performs well in many scenarios, there are still challenges in maintaining speaker consistency over extended outputs. Additionally, the subjective evaluations indicate that while the model performs competitively, there is room for improvement in perceptual audio quality.
The implications of this research are significant for the field of speech synthesis and spoken language modeling. The ability to generate high-fidelity speech with reduced hallucinations and improved efficiency can enhance applications in voice cloning, virtual assistants, and interactive AI systems. The methodology could also pave the way for further advancements in multimodal AI systems, where seamless integration of text and speech is crucial. The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.
Primary: Tampere University
All Institutions: Tampere University, Nokia Technologies
The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
The proposed dual-branch parallel spectral-spatial (PS2) architecture represents a significant methodological advancement in the field of multi-channel speech separation. By separating the processing of spectral and spatial features, the authors effectively address the inherent modeling conflicts present in existing sequential architectures. The use of bi-directional long short-term memory (BLSTM) and bi-directional gated recurrent unit (BGRU) networks, along with a cross-attention fusion mechanism, allows for more nuanced feature extraction and integration. This approach is well-grounded in the theoretical understanding of the different temporal scales at which spectral and spatial features evolve, making it a thoughtful and innovative contribution to the field.
The experimental setup is robust, utilizing multiple datasets including WHAMR! and a newly generated WSJ0-Demand-6ch-Move dataset specifically designed for moving speaker scenarios. The results demonstrate clear improvements over state-of-the-art methods, with significant gains in scale-invariant signal-to-distortion ratio (SI-SDR) across varying acoustic conditions. The ablation studies provide valuable insights into the contributions of each component of the PS2 architecture, reinforcing the importance of the dual-branch design. However, the paper could benefit from a more detailed analysis of the computational efficiency and potential trade-offs involved in the proposed architecture.
The paper provides a comprehensive description of the architecture, training configurations, and datasets used, which supports reproducibility. However, the absence of publicly available code or a project URL limits the ability for others to replicate the results directly. Future work could include releasing the model and training scripts to enhance reproducibility within the research community.
One limitation of the study is the reliance on synthetic datasets for moving speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, while the model shows strong performance in various conditions, the authors do not extensively discuss its limitations in extreme acoustic scenarios or potential failure modes. The evaluation metrics primarily focus on SI-SDR, which, while important, may not encompass all aspects of speech quality and intelligibility.
The advancements in multi-channel speech separation have significant implications for various applications, including voice recognition systems, hearing aids, and telecommunication technologies. The ability to effectively separate moving speakers in dynamic environments could enhance user experiences in real-world applications, making this research particularly relevant in the context of increasing reliance on audio processing technologies in daily life. The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
This paper introduces DashengTokenizer, a continuous audio tokenizer engineered for joint use in both understanding and generation tasks. Unlike conventional approaches, which train acoustic tokenizers and subsequently integrate frozen semantic knowledge, our method inverts this paradigm: we leverage frozen semantic features and inject acoustic information. In linear evaluation across 22 diverse tasks, our method outperforms previous audio codec and audio encoder baselines by a significant margin while maintaining competitive audio reconstruction quality. Notably, we demonstrate that this acoustic injection improves performance for tasks such as speech emotion recognition, music understanding, and acoustic scene classification. We further evaluate the tokenizer's generative performance on text-to-audio (TTA), text-to-music (TTM), and speech enhancement (SE). Our approach surpasses standard variational autoencoder (VAE)-based methods on TTA and TTM tasks, while its effectiveness on SE underscores its capabilities as a general-purpose audio encoder. Finally, our results challenge the prevailing assumption that VAE-based architectures are a prerequisite for audio synthesis. Checkpoints are available at https://huggingface.co/mispeech/dashengtokenizer.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DashengTokenizer, a unified audio tokenizer that enhances both understanding and generation tasks by inverting the traditional paradigm of acoustic tokenization. This innovative approach, alongside its competitive performance across various benchmarks, positions it as a significant advancement in the field of audio machine learning.
The methodology presented in the paper is innovative, as it proposes the DashengTokenizer, which inverts the conventional approach of audio tokenization by leveraging frozen semantic features to inject acoustic information. The simplicity of using a linear projection for acoustic injection is a notable strength, making the method efficient and accessible. However, the paper could benefit from a more detailed explanation of the training process and the selection of hyperparameters.
The experimental evaluation is robust, covering a wide range of tasks across understanding and generation domains, with comparisons to existing state-of-the-art methods. The use of diverse datasets and benchmarks strengthens the findings, and the results indicate significant improvements over traditional methods, particularly in understanding tasks. However, more detailed statistical analysis of the results could enhance the credibility of the claims.
The paper provides a clear overview of the architecture and training setup, which aids reproducibility. The availability of checkpoints on Hugging Face is a positive aspect, although the paper lacks specific implementation details that could help other researchers replicate the experiments more easily.
One limitation is the potential overfitting to the specific datasets used for training and evaluation, which may not generalize to all audio tasks. Additionally, the reliance on a frozen semantic encoder may limit adaptability to new domains without retraining.
The DashengTokenizer has the potential to significantly impact audio understanding and generation tasks, making it a valuable tool for applications in speech recognition, music analysis, and environmental sound classification. Its efficiency and performance could lead to broader adoption in real-world applications, particularly in areas requiring high-fidelity audio processing. The main contribution of this paper is the introduction of DashengTokenizer, a unified audio tokenizer that enhances both understanding and generation tasks by inverting the traditional paradigm of acoustic tokenization. This innovative approach, alongside its competitive performance across various benchmarks, positions it as a significant advancement in the field of audio machine learning.
Dual-mode self-supervised speech models (S3Ms), which jointly pre-trained in the offline and online mode, suffer from attention mismatch in streaming scenarios due to missing future context. To address this challenge, we proposed online registers, learnable tokens appended to each chunk in online mode. These tokens act as virtual placeholders for unseen future frames, enabling the model to compensate for missing context without introducing additional latency. Furthermore, we introduce a future prediction loss that explicitly guides the registers to capture predictive cues, thereby enriching their ability to retain future information. Experiments on LibriSpeech, and out-of-domain benchmarks demonstrate that online registers consistently reduce the performance gap between offline and online modes, achieving a 3.4% relative improvement on LibriSpeech with 160 ms chunks, especially in low-latency settings.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of online registers and a future prediction loss to improve dual-mode self-supervised speech models, addressing the challenge of missing future context in streaming scenarios. This research presents a meaningful advancement in the field of speech recognition, combining innovative methodology with rigorous experimental validation to enhance model performance in real-time applications.
The proposed methodology introduces online registers as learnable tokens that serve as virtual placeholders for future context in dual-mode self-supervised speech models. This approach is innovative as it addresses the critical issue of attention mismatch in streaming scenarios without increasing latency. The incorporation of a future prediction loss further enhances the model's ability to retain useful predictive information. The design is well-structured, leveraging existing frameworks while introducing novel components that effectively bridge the gap between offline and online processing.
The experiments conducted on the LibriSpeech dataset and out-of-domain benchmarks provide a solid evaluation of the proposed method. The reported 3.4% relative improvement in word error rate (WER) demonstrates the effectiveness of online registers, particularly in low-latency settings. The comparison with existing methods like wav2vec 2.0 and UFO2 highlights the competitive performance of the proposed approach. However, the paper could benefit from more extensive experimentation across diverse datasets to validate the generalizability of the findings.
The paper provides detailed implementation information, including model architecture, training procedures, and hyperparameter settings, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results independently.
One notable limitation is the potential overfitting observed when increasing the number of online registers. The paper indicates that using too many registers may degrade performance, suggesting a need for careful tuning. Additionally, the future prediction loss's effectiveness appears to vary depending on the dataset and chunk size, indicating that its benefits may not be universally applicable.
The proposed method has significant implications for real-time speech recognition systems, especially in applications requiring low-latency processing. By effectively mitigating the lack of future context, this research could enhance the performance of speech models in various domains, including virtual assistants, transcription services, and accessibility tools for the hearing impaired. The lightweight nature of the online registers also suggests potential for deployment in resource-constrained environments. The main contribution of this paper is the introduction of online registers and a future prediction loss to improve dual-mode self-supervised speech models, addressing the challenge of missing future context in streaming scenarios. This research presents a meaningful advancement in the field of speech recognition, combining innovative methodology with rigorous experimental validation to enhance model performance in real-time applications.
Recent audio generation models typically rely on Variational Autoencoders (VAEs) and perform generation within the VAE latent space. Although VAEs excel at compression and reconstruction, their latents inherently encode low-level acoustic details rather than semantically discriminative information, leading to entangled event semantics and complicating the training of generative models. To address these issues, we discard VAE acoustic latents and introduce semantic encoder latents, thereby proposing SemanticVocoder, a generative vocoder that directly synthesizes waveforms from semantic latents. Equipped with SemanticVocoder, our text-to-audio generation model achieves a Frechet Distance of 12.823 and a Frechet Audio Distance of 1.709 on the AudioCaps test set, as the introduced semantic latents exhibit superior discriminability compared to acoustic VAE latents. Beyond improved generation performance, it also serves as a promising attempt towards unifying audio understanding and generation within a shared semantic space. Generated samples are available at https://zeyuxie29.github.io/SemanticVocoder/.
Primary: Tencent AI Lab
All Institutions: Tencent AI Lab
The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
The paper introduces SemanticVocoder, a novel approach that replaces traditional VAE acoustic latents with semantic latents for audio generation. The methodology is well-structured, leveraging a flow-matching approach to synthesize waveforms directly from semantic representations, thus addressing the limitations of conventional VAE-based systems. The use of a pretrained MAE encoder for extracting semantic latents is a significant innovation, enabling the model to focus on high-level semantic information rather than low-level acoustic details. The proposed architecture effectively balances the optimization difficulty across the text-to-latent and latent-to-waveform stages, which is a notable advancement in the field.
The experiments are comprehensive, utilizing multiple datasets (AudioCaps, AudioSet, and WavCaps) to evaluate the performance of SemanticVocoder against existing models. The reported results demonstrate superior performance in terms of Fréchet Distance and Fréchet Audio Distance, indicating that the model generates audio closer to real distributions. Additionally, the paper includes evaluations on audio understanding tasks, showcasing the discriminative power of semantic latents. However, the lack of subjective evaluation metrics is a minor drawback.
The paper provides detailed implementation specifics, including model architecture, training parameters, and datasets used, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of reproduction for external researchers.
The paper acknowledges limitations such as dependency on the pretrained semantic encoder's performance, constraints on audio length generation, and the need for subjective evaluations to complement objective metrics. These factors could impact the model's applicability in real-world scenarios.
SemanticVocoder has the potential to significantly advance the field of audio generation and understanding by providing a unified framework that leverages semantic information. This could lead to improved applications in areas such as content creation, audio synthesis, and interactive media, thereby enhancing user experiences in various domains. The main contribution of this paper is the introduction of SemanticVocoder, a generative vocoder that synthesizes audio waveforms from semantic latents, overcoming the limitations of traditional VAE-based approaches and demonstrating superior performance in both audio generation and understanding tasks. This work represents a significant step forward in bridging the gap between audio generation and understanding, with implications for various applications in the audio processing domain.
Modern Text-to-Speech (TTS) systems increasingly leverage Large Language Model (LLM) architectures to achieve scalable, high-fidelity, zero-shot generation. However, these systems typically rely on fixed-frame-rate acoustic tokenization, resulting in speech sequences that are significantly longer than, and asynchronous with their corresponding text. Beyond computational inefficiency, this sequence length disparity often triggers hallucinations in TTS and amplifies the modality gap in spoken language modeling (SLM). In this paper, we propose a novel tokenization scheme that establishes one-to-one synchronization between continuous acoustic features and text tokens, enabling unified, single-stream modeling within an LLM. We demonstrate that these synchronous tokens maintain high-fidelity audio reconstruction and can be effectively modeled in a latent space by a large language model with a flow matching head. Moreover, the ability to seamlessly toggle speech modality within the context enables text-only guidance--a technique that blends logits from text-only and text-speech modes to flexibly bridge the gap toward text-only LLM intelligence. Experimental results indicate that our approach achieves performance competitive with state-of-the-art TTS and SLM systems while virtually eliminating content hallucinations and preserving linguistic integrity, all at a significantly reduced inference cost.
Primary: Dartmouth College
All Institutions: Dartmouth College, Hume AI
The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
The proposed methodology introduces a novel tokenization scheme that achieves one-to-one synchronization between text and acoustic features, which is a significant advancement over traditional fixed-frame-rate approaches. The use of a dual alignment mechanism and a flow matching head within a large language model framework allows for efficient and high-fidelity speech synthesis. The architecture is well-structured, leveraging a combination of variational autoencoders and transformer-based models, which enhances the model's ability to handle both modalities concurrently. The approach to mitigate the modality gap through Speech Free Guidance (SFG) is particularly innovative, allowing for flexible integration of text and speech modalities.
The experiments conducted demonstrate the effectiveness of the proposed model against state-of-the-art TTS and SLM systems. The authors provide a comprehensive evaluation using multiple datasets and metrics, including character error rate, speaker similarity, and subjective evaluations of naturalness. The results indicate that the proposed method not only matches but often exceeds the performance of existing models, particularly in terms of reducing content hallucinations and improving inference efficiency. The extensive dataset used for training and evaluation further supports the robustness of the findings.
The paper includes a link to the GitHub repository containing the code and pre-trained models, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific configurations could be elaborated further to enhance clarity for future researchers attempting to replicate the results.
One limitation noted is the potential for speaker drifting during long-form generation, which suggests that while the model performs well in many scenarios, there are still challenges in maintaining speaker consistency over extended outputs. Additionally, the subjective evaluations indicate that while the model performs competitively, there is room for improvement in perceptual audio quality.
The implications of this research are significant for the field of speech synthesis and spoken language modeling. The ability to generate high-fidelity speech with reduced hallucinations and improved efficiency can enhance applications in voice cloning, virtual assistants, and interactive AI systems. The methodology could also pave the way for further advancements in multimodal AI systems, where seamless integration of text and speech is crucial. The paper presents TADA, a generative framework that synchronizes text and acoustic features for improved speech modeling. This innovative approach addresses key challenges in traditional TTS systems, offering a significant contribution to the field with its potential for high-fidelity, efficient speech synthesis.
Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
Primary: KUET
All Institutions: KUET, BUET
The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.
The methodology presented in the paper is robust, focusing on two critical areas: Automatic Speech Recognition (ASR) and speaker diarization for Bengali audio. The authors systematically evaluated various architectures and fine-tuning strategies, demonstrating a clear understanding of the challenges associated with long-form audio processing. The innovative use of perfectly aligned annotations and synthetic acoustic degradation for ASR is particularly noteworthy. The shift from model retraining to heuristic post-processing for diarization reflects a pragmatic approach to overcoming the limitations of existing models in this domain.
The experiments conducted are thorough, with a clear focus on optimizing inference time and accuracy. The introduction of the Lipi-Ghor-882 dataset is a significant contribution, as it addresses the scarcity of resources for Bengali ASR and diarization. The results indicate a well-structured evaluation process, with the authors providing insights into the performance of various models and the effectiveness of their proposed methods. The achievement of a Real-Time Factor (RTF) of 0.019 is impressive and establishes a benchmark for future work.
While the paper outlines the methodologies and experiments in detail, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the findings or build upon this work.
The paper acknowledges several limitations, including compute resource constraints, data quality issues, and the challenges of model optimization for Bengali features. The reliance on heuristic post-processing for diarization, while effective, indicates that further research is needed to develop more robust models for this task.
This research has significant implications for the field of speech processing, particularly for low-resource languages like Bengali. The introduction of a large-scale dataset and the development of optimized pipelines can facilitate advancements in conversational AI, making it more accessible for Bengali speakers. The findings may also inspire similar approaches in other low-resource language contexts. The main contribution of this paper is the introduction of the Lipi-Ghor-882 dataset and the development of optimized ASR and speaker diarization pipelines for Bengali audio. This work addresses critical gaps in the field, providing a foundation for future research and applications in low-resource speech processing. The comprehensive analysis of methodologies and results highlights the significance of targeted fine-tuning and heuristic post-processing in overcoming the challenges of long-form audio processing.
Audio-visual deepfake detection (AVD) is increasingly important as modern generators can fabricate convincing speech and video. Most current multimodal detectors are small, task-specific models: they work well on curated tests but scale poorly and generalize weakly across domains. We introduce AV-LMMDetect, a supervised fine-tuned (SFT) large multimodal model that casts AVD as a prompted yes/no classification - "Is this video real or fake?". Built on Qwen 2.5 Omni, it jointly analyzes audio and visual streams for deepfake detection and is trained in two stages: lightweight LoRA alignment followed by audio-visual encoder full fine-tuning. On FakeAVCeleb and Mavos-DD, AV-LMMDetect matches or surpasses prior methods and sets a new state of the art on Mavos-DD datasets.
Primary: Fudan University
All Institutions: Fudan University, Tencent Youtu Lab
The main contribution of this paper is the introduction of AV-LMMDetect, a large multimodal model for audio-visual deepfake detection that utilizes a novel two-stage training approach, achieving state-of-the-art results and demonstrating the potential of large models in enhancing detection capabilities. The methodology and results present a meaningful advancement in the field, addressing critical challenges in deepfake detection and setting a foundation for future research.
The paper introduces AV-LMMDetect, a novel supervised fine-tuned large multimodal model for audio-visual deepfake detection, employing a two-stage training strategy that combines lightweight LoRA alignment with full fine-tuning of audio-visual encoders. This methodology is innovative as it reformulates the detection task into a binary question-answering format, which is a fresh approach in the context of deepfake detection. The use of a large multimodal model also enhances the model's ability to capture cross-modal inconsistencies, a significant improvement over traditional methods that often rely on smaller, task-specific models.
The experimental results demonstrate that AV-LMMDetect achieves state-of-the-art performance on the MAVOS-DD dataset and competitive results on FakeAVCeleb. The paper provides thorough evaluations across multiple scenarios, including in-domain and open-set conditions, showcasing the model's robustness and generalization capabilities. The use of standard binary classification metrics (accuracy, AUC, and mAP) adds rigor to the evaluation process, although the paper could benefit from more detailed comparisons with additional baseline methods.
The paper lacks explicit details regarding the implementation and hyperparameter settings, which are crucial for reproducibility. While the methodology is described, the absence of a dedicated section on experimental setup and code availability limits the ability of other researchers to replicate the results. Including a link to a code repository or supplementary materials would significantly enhance reproducibility.
One limitation of the study is the reliance on two specific datasets, which may not fully represent the diversity of deepfake techniques in real-world applications. Additionally, while the model shows strong performance, the paper does not address potential biases in the datasets or the model's performance across different demographic groups, which is critical for ethical considerations in deployment.
The implications of this research are significant, as robust audio-visual deepfake detection is crucial for maintaining media integrity and public trust in an era of increasing misinformation. The proposed model could be applied in various domains, including social media, journalism, and law enforcement, to identify and mitigate the risks posed by deepfake technology. The main contribution of this paper is the introduction of AV-LMMDetect, a large multimodal model for audio-visual deepfake detection that utilizes a novel two-stage training approach, achieving state-of-the-art results and demonstrating the potential of large models in enhancing detection capabilities. The methodology and results present a meaningful advancement in the field, addressing critical challenges in deepfake detection and setting a foundation for future research.
Multi-channel speech separation in dynamic environments is challenging as time-varying spatial and spectral features evolve at different temporal scales. Existing methods typically employ sequential architectures, forcing a single network stream to simultaneously model both feature types, creating an inherent modeling conflict. In this paper, we propose a dual-branch parallel spectral-spatial (PS2) architecture that separately processes spectral and spatial features through parallel streams. The spectral branch uses a bi-directional long short-term memory (BLSTM)-based frequency module, a Mamba-based temporal module, and a self-attention module to model spectral features. The spatial branch employs bi-directional gated recurrent unit (BGRU) networks to process spatial features that encode the evolving geometric relationships between sources and microphones. Features from both branches are integrated through a cross-attention fusion mechanism that adaptively weights their contributions. Experimental results demonstrate that the PS2 outperforms existing state-of-the-art (SOTA) methods by 1.6-2.2 dB in scale-invariant signal-to-distortion ratio (SI-SDR) for moving speaker scenarios, with robust separation quality under different reverberation times (RT60), noise levels, and source movement speeds. Even with fast source movements, the proposed model maintains SI-SDR improvements of over 13 dB. These improvements are consistently observed across multiple datasets, including WHAMR! and our generated WSJ0-Demand-6ch-Move dataset.
Primary: Tampere University
All Institutions: Tampere University, Nokia Technologies
The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
The proposed dual-branch parallel spectral-spatial (PS2) architecture represents a significant methodological advancement in the field of multi-channel speech separation. By separating the processing of spectral and spatial features, the authors effectively address the inherent modeling conflicts present in existing sequential architectures. The use of bi-directional long short-term memory (BLSTM) and bi-directional gated recurrent unit (BGRU) networks, along with a cross-attention fusion mechanism, allows for more nuanced feature extraction and integration. This approach is well-grounded in the theoretical understanding of the different temporal scales at which spectral and spatial features evolve, making it a thoughtful and innovative contribution to the field.
The experimental setup is robust, utilizing multiple datasets including WHAMR! and a newly generated WSJ0-Demand-6ch-Move dataset specifically designed for moving speaker scenarios. The results demonstrate clear improvements over state-of-the-art methods, with significant gains in scale-invariant signal-to-distortion ratio (SI-SDR) across varying acoustic conditions. The ablation studies provide valuable insights into the contributions of each component of the PS2 architecture, reinforcing the importance of the dual-branch design. However, the paper could benefit from a more detailed analysis of the computational efficiency and potential trade-offs involved in the proposed architecture.
The paper provides a comprehensive description of the architecture, training configurations, and datasets used, which supports reproducibility. However, the absence of publicly available code or a project URL limits the ability for others to replicate the results directly. Future work could include releasing the model and training scripts to enhance reproducibility within the research community.
One limitation of the study is the reliance on synthetic datasets for moving speaker scenarios, which may not fully capture the complexities of real-world environments. Additionally, while the model shows strong performance in various conditions, the authors do not extensively discuss its limitations in extreme acoustic scenarios or potential failure modes. The evaluation metrics primarily focus on SI-SDR, which, while important, may not encompass all aspects of speech quality and intelligibility.
The advancements in multi-channel speech separation have significant implications for various applications, including voice recognition systems, hearing aids, and telecommunication technologies. The ability to effectively separate moving speakers in dynamic environments could enhance user experiences in real-world applications, making this research particularly relevant in the context of increasing reliance on audio processing technologies in daily life. The paper presents a novel dual-branch architecture for moving speaker separation that effectively addresses challenges in multi-channel speech processing. The methodology is well-founded and the experimental results demonstrate substantial improvements over existing methods, highlighting its potential impact in practical applications.
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.
Primary: Jilin University
All Institutions: Hunan University, Jilin University, Shandong University, University of Electronic Science and Technology of China
UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.
The methodology presented in UniWhisper is innovative, focusing on a unified instruction and answer format for continual multi-task training. This approach effectively addresses the limitations of existing audio encoders that excel in specific domains but struggle with others. By leveraging a single encoder and a compact pretrained language model as the decoder, the authors streamline the training process and reduce redundancy in audio token representation. The decision to utilize shallow MLP probes and kNN for evaluation is appropriate, as it allows for a clear assessment of the model's performance across diverse tasks.
The experimental setup is robust, utilizing a substantial training dataset of 38k hours of public audio, which enhances the generalizability of the results. The evaluation across 20 tasks provides a comprehensive view of the model's capabilities. The results indicate that UniWhisper outperforms existing models like Whisper, HuBERT, and others, particularly in non-speech tasks, which is a significant achievement. The use of normalized weighted averages for performance metrics is a strong point, ensuring comparability across different tasks.
The paper mentions that code and pretrained weights will be released upon acceptance, which is crucial for reproducibility. However, the details provided in the methodology, such as specific hyperparameters and training configurations, are adequately described, allowing other researchers to replicate the experiments. The use of a compact pretrained language model as a decoder is also a noteworthy detail that aids in understanding the architecture.
One limitation is the reliance on a single encoder, which may not capture all nuances across diverse audio tasks as effectively as dual-encoder systems. Additionally, while the results are promising, the paper does not extensively discuss the potential trade-offs in performance when adapting UniWhisper to other audio domains not covered in the evaluation.
The implications of this work are significant, as it proposes a more efficient method for training universal audio representations that can be applied in various applications, including speech recognition, environmental sound classification, and music analysis. The potential for reduced training costs and improved performance across multiple tasks could lead to advancements in audio processing technologies, making them more accessible and effective in real-world applications. UniWhisper presents a novel approach to continual multi-task training for universal audio representation, achieving competitive performance across diverse audio tasks while maintaining strong speech capabilities. This work significantly contributes to the field by addressing the limitations of existing models and offering a streamlined methodology that enhances both efficiency and effectiveness in audio representation learning.
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages. While a wealth of spoken content is accessible in television dramas and online videos, Taiwanese Hokkien exemplifies this issue, with transcriptions often being scarce and the majority of available subtitles provided only in Mandarin. To address this deficiency, we introduce TG-ASR for Taiwanese Hokkien drama speech recognition, a translation-guided ASR framework that utilizes multilingual translation embeddings to enhance recognition performance in low-resource environments. The framework is centered around the parallel gated cross-attention (PGCA) mechanism, which adaptively integrates embeddings from various auxiliary languages into the ASR decoder. This mechanism facilitates robust cross-linguistic semantic guidance while ensuring stable optimization and minimizing interference between languages. To support ongoing research initiatives, we present YT-THDC, a 30-hour corpus of Taiwanese Hokkien drama speech with aligned Mandarin subtitles and manually verified Taiwanese Hokkien transcriptions. Comprehensive experiments and analyses identify the auxiliary languages that most effectively enhance ASR performance, achieving a 14.77% relative reduction in character error rate and demonstrating the efficacy of translation-guided learning for underrepresented languages in practical applications.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.
The methodology presented in TG-ASR is innovative, particularly the introduction of the Parallel Gated Cross-Attention (PGCA) mechanism, which adaptively integrates multilingual translation embeddings into the ASR decoder. This approach is well-justified, addressing the specific challenges of low-resource languages by leveraging auxiliary languages to improve transcription accuracy. The two-stage training process is clearly articulated, ensuring that the model benefits from both initial fine-tuning and subsequent integration of multilingual embeddings. However, the reliance on pre-trained models for auxiliary language embeddings and the potential noise introduced by machine translations are notable considerations.
The experiments are comprehensive, utilizing the newly created YT-THDC corpus, which is a significant contribution to the field of low-resource ASR. The results demonstrate a substantial reduction in character error rate (CER), validating the effectiveness of the proposed framework. The ablation studies provide insights into the contributions of various components of the PGCA mechanism, reinforcing the robustness of the findings. However, the paper could benefit from additional comparative analyses with state-of-the-art models to contextualize the performance gains more clearly.
The paper provides a detailed description of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly accessible code repository or demo limits the ability for other researchers to replicate the results fully. The authors should consider releasing their code and model weights to facilitate further research and validation.
The study acknowledges several limitations, including the size and domain specificity of the YT-THDC corpus, which may restrict generalizability. Additionally, the reliance on auxiliary translations introduces potential noise that could affect performance. The findings are also specific to Taiwanese Hokkien, and the effectiveness of the approach for other low-resource languages remains to be validated.
The work has significant implications for the preservation of endangered languages and the accessibility of media content. By improving ASR for Taiwanese Hokkien, the research contributes to cultural preservation efforts and enhances bilingual accessibility. The methodology could be adapted for other low-resource languages, potentially benefiting a wider range of linguistic communities. The main contribution of this paper is the TG-ASR framework, which effectively utilizes translation-guided learning through a novel PGCA mechanism to enhance automatic speech recognition for low-resource languages. This research addresses critical gaps in ASR technology and provides a valuable resource for future studies in multilingual and low-resource language processing.
Despite strong performance in audio perception tasks, large audio-language models (AudioLLMs) remain opaque to interpretation. A major factor behind this lack of interpretability is that individual neurons in these models frequently activate in response to several unrelated concepts. We introduce the first mechanistic interpretability framework for AudioLLMs, leveraging sparse autoencoders (SAEs) to disentangle polysemantic activations into monosemantic features. Our pipeline identifies representative audio clips, assigns meaningful names via automated captioning, and validates concepts through human evaluation and steering. Experiments show that AudioLLMs encode structured and interpretable features, enhancing transparency and control. This work provides a foundation for trustworthy deployment in high-stakes domains and enables future extensions to larger models, multilingual audio, and more fine-grained paralinguistic features. Project URL: https://townim-faisal.github.io/AutoInterpret-AudioLLM/
Primary: Dolby Laboratories
All Institutions: Dolby Laboratories
The paper presents AR&D, a pioneering framework for interpreting AudioLLMs by disentangling polysemantic activations into interpretable features, significantly advancing the field of audio machine learning. The methodology is innovative and well-executed, demonstrating substantial technical impact and relevance in enhancing model interpretability.
The paper introduces the AR&D framework, which effectively utilizes sparse autoencoders to disentangle polysemantic activations in AudioLLMs. The methodology is well-structured, comprising three main components: feature disentanglement, representative audio retrieval, and interpretable concept naming. The use of human evaluation to validate the concepts adds robustness to the approach. However, the reliance on automated captioning for naming may introduce biases or inaccuracies, which should be addressed in future work.
The experiments are comprehensive, comparing AR&D against multiple baseline methods. The metrics used for evaluation, including precision, recall, and F1 scores, are appropriate for assessing interpretability. The results demonstrate a clear advantage of AR&D over the baselines, with significant improvements in semantic alignment and steering sensitivity. However, the paper could benefit from more extensive ablation studies to further validate the contributions of each component in the pipeline.
The implementation details provided are thorough, including specifics on datasets, training parameters, and model architectures. However, the paper lacks a complete code release or clear instructions for reproducing the experiments, which could hinder reproducibility efforts by other researchers.
One limitation is the potential bias in automated captioning, which may not always accurately reflect human perceptions of audio concepts. Additionally, the framework's performance may vary with different AudioLLMs, and the scalability to larger models or diverse datasets remains to be fully explored.
The proposed framework has significant implications for enhancing the interpretability of AudioLLMs, which is crucial for their deployment in high-stakes applications such as healthcare and assistive technologies. By improving transparency, AR&D can foster trust in AI systems that rely on audio data, paving the way for more responsible and ethical AI practices. The paper presents AR&D, a pioneering framework for interpreting AudioLLMs by disentangling polysemantic activations into interpretable features, significantly advancing the field of audio machine learning. The methodology is innovative and well-executed, demonstrating substantial technical impact and relevance in enhancing model interpretability.