Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.
Primary: Monash University
All Institutions: China Monash University, Monash University, Shengshu AI, Tsinghua University
The paper presents a significant advancement in unified video-text-to-audio generation through the introduction of the SoundAtlas dataset and the Omni2Sound model, addressing critical challenges in the field. The comprehensive methodology and robust experimental evaluation highlight its potential impact on future research and applications in audio generation.
The paper presents a comprehensive methodology that addresses two foundational challenges in unified audio generation: data scarcity and task competition. The introduction of the SoundAtlas dataset, which significantly enhances the quality and alignment of audio captions, is a notable contribution. The Omni2Sound model employs a diffusion-based architecture and a three-stage progressive training schedule, effectively mitigating cross-task and intra-task competition. The methodology is well-structured, leveraging advanced multimodal techniques and a decoupled training approach that enhances audio-visual alignment and generation fidelity.
The experimental evaluation is robust, utilizing a large-scale dataset and a comprehensive benchmark (VGGSound-Omni) to assess the performance of the Omni2Sound model across multiple tasks (V2A, T2A, VT2A). The results demonstrate state-of-the-art performance, with significant improvements over existing models. The paper includes extensive ablation studies that validate the proposed methods and their effectiveness in resolving competition between tasks. However, the paper could benefit from more detailed discussions on the statistical significance of the results.
The paper provides a clear description of the methodologies and datasets used, but it lacks specific implementation details that would facilitate reproducibility. While the authors mention the use of a standard DiT backbone, further information on hyperparameters, training procedures, and code availability would enhance reproducibility.
One limitation is the reliance on the quality of the SoundAtlas dataset, which, while claimed to outperform existing datasets, may still have inherent biases or limitations that could affect the model's performance. Additionally, the paper does not address potential scalability issues when applying the model to larger datasets or real-world applications.
The proposed unified model has significant implications for various applications in audio generation, including film production, video games, and virtual reality, where high-quality audio generation from multimodal inputs is crucial. The advancements in audio-visual alignment and the introduction of a comprehensive benchmark could drive further research in the field, fostering innovation in multimodal AI systems. The paper presents a significant advancement in unified video-text-to-audio generation through the introduction of the SoundAtlas dataset and the Omni2Sound model, addressing critical challenges in the field. The comprehensive methodology and robust experimental evaluation highlight its potential impact on future research and applications in audio generation.
Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of MM-Sonate, a unified framework for multimodal audio-video generation that achieves state-of-the-art performance and introduces innovative techniques for zero-shot voice cloning. This paper significantly advances the field by addressing critical challenges in audio-video synchronization and speaker identity preservation, paving the way for more sophisticated generative models.
The proposed methodology, MM-Sonate, introduces a unified multimodal flow-matching framework that integrates audio-video generation with zero-shot voice cloning capabilities. The use of a unified instruction-phoneme input format is innovative, allowing for precise synchronization and control over the generated outputs. The timbre injection mechanism effectively decouples speaker identity from linguistic content, which is a significant advancement over existing models that struggle with this aspect. The introduction of a noise-based negative conditioning strategy enhances acoustic fidelity, addressing limitations in traditional classifier-free guidance approaches. Overall, the methodology is well-structured and addresses key challenges in the field.
The empirical evaluations are robust, demonstrating that MM-Sonate achieves state-of-the-art performance across various benchmarks, particularly in lip synchronization and speech intelligibility. The paper provides comprehensive comparisons against existing models, showcasing significant improvements in both objective metrics and human preference evaluations. The extensive dataset preparation, including a high-fidelity synthetic dataset and a large-scale multimodal pre-training corpus, supports the model's generalization capabilities across diverse generation tasks.
The paper includes detailed implementation details, including architecture specifications, training strategies, and evaluation metrics. However, the lack of a publicly available code repository or demo limits reproducibility. The methodology is described in sufficient detail for other researchers to replicate the experiments, but access to the actual model and datasets would enhance reproducibility further.
The paper acknowledges several limitations, including challenges in generating long-form content and potential loss of high-frequency details due to compression. Additionally, the model may struggle with extreme head poses or occlusions, leading to synchronization issues. Ethical concerns regarding the misuse of voice cloning technology are also highlighted, emphasizing the need for responsible deployment.
MM-Sonate has significant potential applications in various domains, including entertainment, education, and virtual reality, where personalized audio-visual content is valuable. However, the ethical implications of voice cloning and the potential for misuse in creating deepfakes and misinformation must be carefully managed. The authors propose mitigation strategies, such as embedding watermarks in generated content, which is a proactive approach to addressing these concerns. The main contribution of this work is the introduction of MM-Sonate, a unified framework for multimodal audio-video generation that achieves state-of-the-art performance and introduces innovative techniques for zero-shot voice cloning. This paper significantly advances the field by addressing critical challenges in audio-video synchronization and speaker identity preservation, paving the way for more sophisticated generative models.
A binaural rendering framework for personal sound zones (PSZs) is proposed to enable multiple head-tracked listeners to receive fully independent stereo audio programs. Current PSZ systems typically rely on monophonic rendering and therefore cannot control the left and right ears separately, which limits the quality and accuracy of spatial imaging. The proposed method employs a Binaural Spatially Adaptive Neural Network (BSANN) to generate ear-optimized loudspeaker filters that reconstruct the desired acoustic field at each ear of multiple listeners. The framework integrates anechoically measured loudspeaker frequency responses, analytically modeled transducer directivity, and rigid-sphere head-related transfer functions (HRTFs) to enhance acoustic accuracy and spatial rendering fidelity. An explicit active crosstalk cancellation (XTC) stage further improves three-dimensional spatial perception. Experiments show significant gains in measured objective performance metrics, including inter-zone isolation (IZI), inter-program isolation (IPI), and crosstalk cancellation (XTC), with log-frequency-weighted values of 10.23/10.03 dB (IZI), 11.11/9.16 dB (IPI), and 10.55/11.13 dB (XTC), respectively, over 100-20,000 Hz. The combined use of ear-wise control, accurate acoustic modeling, and integrated active XTC produces a unified rendering method that delivers greater isolation performance, increased robustness to room asymmetry, and more faithful spatial reproduction in real acoustic environments.
Primary: Princeton University
All Institutions: Princeton University
The paper presents a novel framework for binaural audio rendering that significantly improves isolation and crosstalk cancellation for multiple listeners. The integration of physically informed acoustic modeling and the innovative two-stage training process represent substantial advancements in the field of spatial audio rendering.
The proposed Binaural Spatially Adaptive Neural Network (BSANN) framework innovatively extends the existing monophonic Spatially Adaptive Neural Network (SANN) by enabling ear-wise control for multiple head-tracked listeners. The methodology is robust, integrating anechoically measured loudspeaker frequency responses, analytic piston directivity, and rigid-sphere HRTFs, which collectively enhance the acoustic realism and fidelity of the audio rendering. The two-stage training process, which includes a pretraining phase for personal sound zones followed by an active crosstalk cancellation (XTC) stage, is well-structured and addresses the limitations of previous models effectively.
The experimental setup is comprehensive, utilizing a 24-loudspeaker array and head-and-torso simulators to evaluate the performance of the BSANN framework under realistic acoustic conditions. The metrics for evaluation—inter-zone isolation (IZI), inter-program isolation (IPI), and crosstalk cancellation (XTC)—are appropriate and provide a clear indication of the framework's effectiveness. The results demonstrate significant improvements over the monophonic SANN, highlighting the benefits of the BSANN architecture in achieving balanced performance across listeners.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility, such as hyperparameter tuning processes and code availability. The absence of a project URL or demo page further limits the ability for others to replicate the findings.
One limitation of the study is its focus on a two-listener scenario, which may not fully capture the complexities of larger multi-listener environments. Additionally, while the authors mention future work involving full six-degree-of-freedom head motion, this aspect remains unaddressed in the current framework. The reliance on specific loudspeaker configurations may also limit the generalizability of the findings to other setups.
The BSANN framework has significant implications for various applications, including home entertainment systems, automotive audio, and public environments, where personalized audio experiences are increasingly sought after. By enabling independent audio programs for multiple listeners within a shared space, this research contributes to the advancement of spatial audio technologies and enhances user experience in immersive environments. The paper presents a novel framework for binaural audio rendering that significantly improves isolation and crosstalk cancellation for multiple listeners. The integration of physically informed acoustic modeling and the innovative two-stage training process represent substantial advancements in the field of spatial audio rendering.
A binaural rendering framework for personal sound zones (PSZs) is proposed to enable multiple head-tracked listeners to receive fully independent stereo audio programs. Current PSZ systems typically rely on monophonic rendering and therefore cannot control the left and right ears separately, which limits the quality and accuracy of spatial imaging. The proposed method employs a Binaural Spatially Adaptive Neural Network (BSANN) to generate ear-optimized loudspeaker filters that reconstruct the desired acoustic field at each ear of multiple listeners. The framework integrates anechoically measured loudspeaker frequency responses, analytically modeled transducer directivity, and rigid-sphere head-related transfer functions (HRTFs) to enhance acoustic accuracy and spatial rendering fidelity. An explicit active crosstalk cancellation (XTC) stage further improves three-dimensional spatial perception. Experiments show significant gains in measured objective performance metrics, including inter-zone isolation (IZI), inter-program isolation (IPI), and crosstalk cancellation (XTC), with log-frequency-weighted values of 10.23/10.03 dB (IZI), 11.11/9.16 dB (IPI), and 10.55/11.13 dB (XTC), respectively, over 100-20,000 Hz. The combined use of ear-wise control, accurate acoustic modeling, and integrated active XTC produces a unified rendering method that delivers greater isolation performance, increased robustness to room asymmetry, and more faithful spatial reproduction in real acoustic environments.
Primary: Princeton University
All Institutions: Princeton University
The paper presents a novel framework for binaural audio rendering that significantly improves isolation and crosstalk cancellation for multiple listeners. The integration of physically informed acoustic modeling and the innovative two-stage training process represent substantial advancements in the field of spatial audio rendering.
The proposed Binaural Spatially Adaptive Neural Network (BSANN) framework innovatively extends the existing monophonic Spatially Adaptive Neural Network (SANN) by enabling ear-wise control for multiple head-tracked listeners. The methodology is robust, integrating anechoically measured loudspeaker frequency responses, analytic piston directivity, and rigid-sphere HRTFs, which collectively enhance the acoustic realism and fidelity of the audio rendering. The two-stage training process, which includes a pretraining phase for personal sound zones followed by an active crosstalk cancellation (XTC) stage, is well-structured and addresses the limitations of previous models effectively.
The experimental setup is comprehensive, utilizing a 24-loudspeaker array and head-and-torso simulators to evaluate the performance of the BSANN framework under realistic acoustic conditions. The metrics for evaluation—inter-zone isolation (IZI), inter-program isolation (IPI), and crosstalk cancellation (XTC)—are appropriate and provide a clear indication of the framework's effectiveness. The results demonstrate significant improvements over the monophonic SANN, highlighting the benefits of the BSANN architecture in achieving balanced performance across listeners.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility, such as hyperparameter tuning processes and code availability. The absence of a project URL or demo page further limits the ability for others to replicate the findings.
One limitation of the study is its focus on a two-listener scenario, which may not fully capture the complexities of larger multi-listener environments. Additionally, while the authors mention future work involving full six-degree-of-freedom head motion, this aspect remains unaddressed in the current framework. The reliance on specific loudspeaker configurations may also limit the generalizability of the findings to other setups.
The BSANN framework has significant implications for various applications, including home entertainment systems, automotive audio, and public environments, where personalized audio experiences are increasingly sought after. By enabling independent audio programs for multiple listeners within a shared space, this research contributes to the advancement of spatial audio technologies and enhances user experience in immersive environments. The paper presents a novel framework for binaural audio rendering that significantly improves isolation and crosstalk cancellation for multiple listeners. The integration of physically informed acoustic modeling and the innovative two-stage training process represent substantial advancements in the field of spatial audio rendering.
Audio deepfake detection has become increasingly challenging due to rapid advances in speech synthesis and voice conversion technologies, particularly under channel distortions, replay attacks, and real-world recording conditions. This paper proposes a resolution-aware audio deepfake detection framework that explicitly models and aligns multi-resolution spectral representations through cross-scale attention and consistency learning. Unlike conventional single-resolution or implicit feature-fusion approaches, the proposed method enforces agreement across complementary time--frequency scales. The proposed framework is evaluated on three representative benchmarks: ASVspoof 2019 (LA and PA), the Fake-or-Real (FoR) dataset, and the In-the-Wild Audio Deepfake dataset under a speaker-disjoint protocol. The method achieves near-perfect performance on ASVspoof LA (EER 0.16%), strong robustness on ASVspoof PA (EER 5.09%), FoR rerecorded audio (EER 4.54%), and in-the-wild deepfakes (AUC 0.98, EER 4.81%), significantly outperforming single-resolution and non-attention baselines under challenging conditions. The proposed model remains lightweight and efficient, requiring only 159k parameters and less than 1~GFLOP per inference, making it suitable for practical deployment. Comprehensive ablation studies confirm the critical contributions of cross-scale attention and consistency learning, while gradient-based interpretability analysis reveals that the model learns resolution-consistent and semantically meaningful spectral cues across diverse spoofing conditions. These results demonstrate that explicit cross-resolution modeling provides a principled, robust, and scalable foundation for next-generation audio deepfake detection systems.
Primary: Bangladesh University of Engineering and Technology (BUET)
All Institutions: Bangladesh University of Engineering and Technology (BUET)
This paper presents a resolution-aware audio deepfake detection framework that significantly enhances detection capabilities by explicitly modeling interactions among multiple spectral resolutions. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the field of audio processing and machine learning.
The proposed methodology introduces a resolution-aware framework that leverages cross-scale attention and consistency learning to enhance audio deepfake detection. This approach is innovative as it explicitly models interactions among multiple spectral resolutions, addressing limitations of existing methods that typically rely on single-resolution features. The architecture is well-structured, with a shared encoder and attention mechanism that adaptively integrates information across resolutions. The use of a consistency learning objective to enforce resolution-invariant characteristics is particularly noteworthy, as it aims to improve robustness against channel distortions and replay attacks.
The experimental evaluation is comprehensive, utilizing three diverse datasets: ASVspoof 2019, Fake-or-Real (FoR), and In-the-Wild Audio Deepfake datasets. The results demonstrate strong performance, achieving near-perfect accuracy on controlled benchmarks and maintaining robustness under challenging conditions. The ablation studies effectively highlight the contributions of different components of the model, confirming the importance of cross-scale attention and consistency learning in enhancing detection capabilities.
The paper mentions that the code will be made publicly available upon acceptance, which is a positive aspect for reproducibility. However, the lack of a demo URL or direct access to the code at this stage limits immediate reproducibility. The methodology is described in sufficient detail, allowing for potential replication of the experiments.
Some limitations include the exclusive reliance on Mel-spectrogram representations, which may not capture all relevant features for audio deepfake detection. Additionally, the consistency learning objective is applied only to bona fide speech, which may not universally apply across all recording conditions. The paper also acknowledges the evolving nature of generative speech models, suggesting that the proposed method may need continuous adaptation to remain effective.
The proposed framework has significant implications for security and forensic applications, as it addresses the growing threat of audio deepfakes. By providing a robust detection mechanism that generalizes well across various conditions, this work can contribute to the development of reliable systems for identifying manipulated audio, thereby enhancing trust in audio communications and media. This paper presents a resolution-aware audio deepfake detection framework that significantly enhances detection capabilities by explicitly modeling interactions among multiple spectral resolutions. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the field of audio processing and machine learning.
Although Coordinate-MLP-based implicit neural representations have excelled in representing radiance fields, 3D shapes, and images, their application to audio signals remains underexplored. To fill this gap, we investigate existing implicit neural representations, from which we extract 3 types of positional encoding and 16 commonly used activation functions. Through combinatorial design, we establish the first benchmark for Coordinate-MLPs in audio signal representations. Our benchmark reveals that Coordinate-MLPs require complex hyperparameter tuning and frequency-dependent initialization, limiting their robustness. To address these issues, we propose Fourier-ASR, a novel framework based on the Fourier series theorem and the Kolmogorov-Arnold representation theorem. Fourier-ASR introduces Fourier Kolmogorov-Arnold Networks (Fourier-KAN), which leverage periodicity and strong nonlinearity to represent audio signals, eliminating the need for additional positional encoding. Furthermore, a Frequency-adaptive Learning Strategy (FaLS) is proposed to enhance the convergence of Fourier-KAN by capturing high-frequency components and preventing overfitting of low-frequency signals. Extensive experiments conducted on natural speech and music datasets reveal that: (1) well-designed positional encoding and activation functions in Coordinate-MLPs can effectively improve audio representation quality; and (2) Fourier-ASR can robustly represent complex audio signals without extensive hyperparameter tuning. Looking ahead, the continuity and infinite resolution of implicit audio representations make our research highly promising for tasks such as audio compression, synthesis, and generation. The source code will be released publicly to ensure reproducibility. The code is available at https://github.com/lif314/Fourier-ASR.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of Fourier-ASR, a novel framework for audio signal representation that effectively utilizes implicit neural representations and provides a benchmark for evaluating Coordinate-MLPs in audio tasks. This work significantly advances the field by addressing the challenges of high-frequency audio representation and enhancing the robustness of neural models in audio processing.
The paper introduces a novel framework, Fourier-ASR, which leverages Fourier series and the Kolmogorov-Arnold representation theorem to improve audio signal representation using implicit neural representations. The methodology is well-structured, combining theoretical foundations with practical implementations, and proposes a Frequency-adaptive Learning Strategy (FaLS) to enhance convergence. The benchmark established for Coordinate-MLPs in audio representation is a significant contribution, providing a comprehensive analysis of positional encodings and activation functions.
The experiments are extensive, utilizing natural speech and music datasets to validate the proposed methods. The results demonstrate that Fourier-ASR outperforms traditional Coordinate-MLPs in robustness and representation quality, especially in capturing high-frequency components. The use of various metrics such as SNR and LSD provides a solid basis for comparison.
The authors commit to releasing the source code publicly, which is crucial for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup and hyperparameter tuning processes to facilitate replication by other researchers.
The paper acknowledges the sensitivity of the proposed methods to hyperparameter configurations, particularly in the context of positional encodings and activation functions. Additionally, while the proposed Fourier-KAN shows promise, it may still face challenges in generalization across diverse audio signals.
The implications of this research are significant for audio processing applications, including compression, synthesis, and generation. The ability to represent audio signals continuously and with high fidelity opens avenues for advancements in audio technologies and machine learning applications in sound processing. The main contribution of this paper is the introduction of Fourier-ASR, a novel framework for audio signal representation that effectively utilizes implicit neural representations and provides a benchmark for evaluating Coordinate-MLPs in audio tasks. This work significantly advances the field by addressing the challenges of high-frequency audio representation and enhancing the robustness of neural models in audio processing.
Target speaker extraction (TSE) aims to recover the speech signal of a desired speaker from a mixed audio recording, given a short enrollment utterance. Most existing TSE approaches are based on discriminative modeling paradigms. Although effective at suppressing interfering speakers, these methods often struggle to produce speech with high perceptual quality and naturalness. To address this limitation, we first propose LauraTSE, a generative TSE model built upon an auto-regressive decoder-only language model. However, purely generative approaches may suffer from hallucinations, content drift, and limited controllability, which may undermine their reliability in complex acoustic scenarios. To overcome these challenges, we further introduce a discriminative-generative TSE framework. In this framework, a discriminative front-end is employed to robustly extract the target speaker's speech, yielding stable and controllable intermediate representations. A generative back-end then operates in the neural audio codec representation space to reconstruct fine-grained speech details and enhance perceptual quality. This two-stage design effectively combines the robustness and controllability of discriminative models with the superior naturalness and quality enhancement capabilities of generative models. Moreover, we systematically investigate collaborative training strategies for the proposed framework, including freezing or fine-tuning the front-end, incorporating an auxiliary SI-SDR loss, and exploring both auto-regressive and non-auto-regressive inference mechanisms. Experimental results demonstrate that the proposed framework achieves a more favorable trade-off among speech quality, intelligibility, and speaker consistency.
Primary: Wuhan University
All Institutions: Wuhan University, Duke Kunshan University, North Carolina State University
The paper presents a discriminative-generative framework for target speaker extraction that effectively combines the strengths of both modeling paradigms, demonstrating improved speech quality and intelligibility. The comprehensive evaluation and systematic exploration of training strategies highlight its potential impact on the field of audio processing.
The paper introduces a novel discriminative-generative framework for target speaker extraction (TSE) that combines the strengths of both modeling paradigms. The proposed architecture consists of a discriminative front-end for robust extraction and a generative back-end for high-quality reconstruction. This two-stage approach is well-justified, addressing the limitations of existing methods that struggle with perceptual quality and naturalness. The integration of collaborative training strategies and the systematic exploration of both auto-regressive and non-auto-regressive inference mechanisms further enhance the methodology's robustness. However, the paper could benefit from clearer explanations of the model's training dynamics and the rationale behind specific design choices.
The experimental validation is comprehensive, utilizing a well-defined dataset (LibriMix) and employing various evaluation metrics that are appropriate for assessing speech quality, intelligibility, and speaker consistency. The results demonstrate a favorable trade-off among these metrics, showcasing the effectiveness of the proposed framework. However, the paper could improve by providing more detailed comparisons with other state-of-the-art methods and discussing the implications of the results in a broader context.
While the paper provides a thorough description of the model architecture and training procedures, it lacks specific implementation details such as hyperparameter settings, training duration, and the computational resources used. Including this information would enhance reproducibility and allow other researchers to replicate the findings more easily.
One limitation of the proposed approach is its reliance on a two-stage framework, which may introduce additional complexity and computational overhead. Additionally, the performance of the generative model, LauraTSE, is shown to be sensitive to the design and discretization of input representations, which could impact its reliability in diverse acoustic scenarios. The paper also raises questions about the scalability of the generative model with respect to training data size and its comparative performance against discriminative models in terms of semantic consistency.
The proposed framework has significant potential applications in real-world scenarios such as voice recognition, assistive technologies, and audio processing systems where target speaker extraction is critical. By improving speech quality and intelligibility in mixed environments, this research could enhance communication technologies and contribute to advancements in human-computer interaction. The paper presents a discriminative-generative framework for target speaker extraction that effectively combines the strengths of both modeling paradigms, demonstrating improved speech quality and intelligibility. The comprehensive evaluation and systematic exploration of training strategies highlight its potential impact on the field of audio processing.
Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly ``human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under `` listening-while-speaking'' conditions. This paper summarizes the dataset, track configurations, and the final results.
Primary: Soul AI lab
All Institutions: Soul AI lab, XYZ agency
The main contribution of this paper is the establishment of the HumDial Challenge, which benchmarks emotional intelligence and full-duplex interaction in spoken dialogue systems. This initiative represents a significant step forward in evaluating and improving the capabilities of dialogue systems, particularly in the context of LLM advancements, and sets a foundation for future research in human-computer interaction.
The paper presents a well-structured methodology for assessing human-like spoken dialogue systems through the HumDial Challenge, focusing on two critical aspects: Emotional Intelligence and Full-Duplex Interaction. The use of a hybrid dataset combining LLM-generated scripts and human performance is innovative and addresses the limitations of existing benchmarks. The evaluation framework, which includes both automated and human assessments, adds rigor to the methodology. However, the reliance on professional actors for data collection may introduce biases that need to be acknowledged.
The results section highlights the challenge's competitive nature, with over 100 registered teams and 15 valid submissions. The performance metrics indicate that while teams excelled in emotional tracking and reasoning, generating empathetic responses remains a challenge. The detailed breakdown of scores across tasks provides insight into the strengths and weaknesses of various systems, emphasizing the need for further advancements in real-time interaction capabilities.
The paper outlines the evaluation methodology and dataset construction processes in detail, which aids reproducibility. However, the absence of specific implementation details or code availability limits the ability for others to replicate the results fully. The authors should consider providing access to the datasets and evaluation scripts to enhance reproducibility.
One identified limitation is the potential bias introduced by using professional actors for the dataset, which may not fully capture the variability of real-world interactions. Additionally, the challenge's focus on specific emotional dynamics and interaction scenarios may not encompass the full spectrum of human dialogue complexities.
The HumDial Challenge has significant implications for the development of more sophisticated spoken dialogue systems that can engage users in a more human-like manner. The focus on emotional intelligence and real-time interaction is particularly relevant for applications in mental health, customer service, and social robotics, where empathetic communication is crucial. The main contribution of this paper is the establishment of the HumDial Challenge, which benchmarks emotional intelligence and full-duplex interaction in spoken dialogue systems. This initiative represents a significant step forward in evaluating and improving the capabilities of dialogue systems, particularly in the context of LLM advancements, and sets a foundation for future research in human-computer interaction.
Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific fine-tuning and an optimized inference procedure, which internalizes speech-text alignment while ensuring high consistency between the speech before and after editing. By fine-tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion-parameter language model baselines but also matches the performance of state-of-the-art cascade approaches. These results demonstrate that, with task-specific fine-tuning and inference optimization, robust and efficient speech editing capabilities can be unlocked from a zero-shot TTS model, yielding a novel and cost-effective end-to-end solution for high-quality speech editing.
Primary: Nankai University
All Institutions: Nankai University, Lingxi Technology
The main contribution of this paper is the introduction of CosyEdit, an end-to-end speech editing model that effectively utilizes a zero-shot TTS model to achieve high-quality speech editing with minimal fine-tuning. This work represents a significant advancement in the field of speech processing, addressing key challenges in speech editing while demonstrating robust performance against existing methods.
The methodology presented in CosyEdit is well-structured, focusing on adapting a zero-shot TTS model for speech editing through task-specific fine-tuning and optimized inference strategies. The authors propose a novel post-training strategy that constructs a supervised dataset from existing speech corpora, showcasing a clear understanding of the challenges in speech editing. The approach effectively internalizes speech-text alignment, which is a significant advancement over traditional cascade systems that rely on external alignment tools. The combination of autoregressive and non-autoregressive models to enhance speech editing capabilities is innovative and demonstrates a thoughtful integration of different modeling techniques.
The experimental evaluation is comprehensive, utilizing the RealEdit benchmark to assess the performance of CosyEdit against various baseline models. The results indicate that CosyEdit outperforms several billion-parameter models and matches the performance of state-of-the-art cascade approaches, which is a strong testament to its effectiveness. The use of both objective and subjective metrics, including WER, SpkSIM, and MOS, provides a well-rounded assessment of the model's capabilities. The ablation studies further enhance the understanding of the contributions of different components of the model.
The paper provides sufficient details regarding the training process, including the dataset used and the training parameters, which aids in reproducibility. However, the absence of a demo or project URL limits the ability for others to directly access the implementation. Clear descriptions of the experimental setup and evaluation metrics enhance the reproducibility of the findings.
One limitation of the study is the reliance on a relatively small dataset (250 hours) for fine-tuning, which may affect the generalizability of the model to more diverse speech editing tasks. Additionally, while the results are promising, further exploration of the model's performance in real-world scenarios and with more complex editing tasks would be beneficial. The potential for misuse in generating deepfakes is also acknowledged, suggesting a need for responsible deployment.
The implications of this research are significant, particularly in fields such as multimedia production, intelligent contact centers, and speech data augmentation. By enabling efficient and high-quality speech editing, CosyEdit could enhance the capabilities of various applications that rely on speech processing. The authors also mention plans for open-sourcing their code and datasets, which could foster further research and development in this area. The main contribution of this paper is the introduction of CosyEdit, an end-to-end speech editing model that effectively utilizes a zero-shot TTS model to achieve high-quality speech editing with minimal fine-tuning. This work represents a significant advancement in the field of speech processing, addressing key challenges in speech editing while demonstrating robust performance against existing methods.
This study proposes FlexiVoice, a text-to-speech (TTS) synthesis system capable of flexible style control with zero-shot voice cloning. The speaking style is controlled by a natural-language instruction and the voice timbre is provided by a speech reference in zero-shot manner. FlexiVoice is built with an LLM core, which takes text as input, and also takes an optional natural language instruction and an optional speech reference to control style and timbre, respectively. FlexiVoice is equipped with a novel Progressive Post-Training (PPT) scheme that progressively unlocks accurate and flexible controllability. In particular, it first employs Direct Preference Optimization (DPO) to enable FlexiVoice to accurately follow both natural language instruction and speech reference simultaneously. It then uses a multi-objective Group Relative Policy Optimization (GRPO) to disentangle style instruction, reference timbre, and textual content. Finally, it adapts instruction GRPO for more advanced instruction following. Experimental results show that FlexiVoice surpasses competing baselines and demonstrates strong capability in decoupling control factors. Human evaluations further confirm its naturalness, controllability, and robustness. Audio samples are available at https://flexi-voice.github.io.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of FlexiVoice, a novel TTS system that enables flexible style control through natural language instructions and zero-shot voice cloning, significantly advancing the state of the art in controllable speech synthesis. The comprehensive approach to methodology, rigorous experimental validation, and potential for real-world applications underscore its significance in the field of machine learning and audio processing.
The methodology presented in this paper is robust and innovative, particularly with the introduction of the Progressive Post-Training (PPT) framework, which systematically addresses the challenges of style-timbre-content conflict in TTS systems. The use of Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) for multi-objective training is a significant advancement in the field, allowing for effective disentanglement of style and timbre while maintaining content integrity. The incorporation of a large-scale, annotated instruction-speech dataset further enhances the model's training and performance, showcasing a comprehensive approach to improving TTS systems.
The experimental evaluation is thorough, with a well-defined setup that includes both qualitative and quantitative assessments. The authors compare FlexiVoice against a range of competitive baselines, demonstrating its superior performance in terms of instruction adherence and robustness. Human evaluations confirm the model's naturalness and controllability, providing strong empirical support for the claims made. The diverse datasets used for training and evaluation, including emotion-centric tasks and complex instruction-following benchmarks, lend credibility to the findings.
The paper emphasizes reproducibility by detailing the model architecture, training objectives, and evaluation protocols. The authors commit to releasing the instruction-speech dataset, model checkpoints, and code, which is essential for facilitating replication and further research. However, the complexity of the training process and the reliance on specific datasets may pose challenges for complete reproducibility without access to the exact resources used.
While the paper presents significant advancements, it does not address potential limitations in terms of the model's performance across diverse languages and dialects beyond English and Chinese. Additionally, the reliance on large datasets for training may limit accessibility for researchers with fewer resources. The potential for misuse of advanced TTS technology, such as generating deceptive audio, is also a concern that is briefly mentioned but not deeply explored.
The implications of this research are substantial, as it enhances the capabilities of TTS systems, making them more adaptable for various applications, including virtual assistants, audiobooks, and gaming. The ability to control speaking style through natural language instructions opens up new avenues for user interaction and personalization. However, ethical considerations regarding the misuse of such technology must be carefully managed to prevent potential harm. The main contribution of this paper is the introduction of FlexiVoice, a novel TTS system that enables flexible style control through natural language instructions and zero-shot voice cloning, significantly advancing the state of the art in controllable speech synthesis. The comprehensive approach to methodology, rigorous experimental validation, and potential for real-world applications underscore its significance in the field of machine learning and audio processing.
Noise-robust automatic speech recognition (ASR) has been commonly addressed by applying speech enhancement (SE) at the waveform level before recognition. However, speech-level enhancement does not always translate into consistent recognition improvements due to residual distortions and mismatches with the latent space of the ASR encoder. In this letter, we introduce a complementary strategy termed latent-level enhancement, where distorted representations are refined during ASR inference. Specifically, we propose a plug-and-play Flow Matching Refinement module (FM-Refiner) that operates on the output latents of a pretrained CTC-based ASR encoder. Trained to map imperfect latents-either directly from noisy inputs or from enhanced-but-imperfect speech-toward their clean counterparts, the FM-Refiner is applied only at inference, without fine-tuning ASR parameters. Experiments show that FM-Refiner consistently reduces word error rate, both when directly applied to noisy inputs and when combined with conventional SE front-ends. These results demonstrate that latent-level refinement via flow matching provides a lightweight and effective complement to existing SE approaches for robust ASR.
Primary: Hanyang University
All Institutions: Hanyang University
This paper presents a significant advancement in noise-robust automatic speech recognition through the introduction of a latent-level enhancement technique. The innovative FM-Refiner effectively bridges the gap between noisy and clean representations, demonstrating substantial improvements in ASR performance across various noise conditions.
The paper introduces a novel approach to automatic speech recognition (ASR) by proposing the Flow Matching Refinement module (FM-Refiner), which operates on the latent representations of a pretrained CTC-based ASR encoder. The methodology is well-structured, leveraging flow matching to refine distorted latents into cleaner representations. The plug-and-play nature of the FM-Refiner allows it to be integrated into existing ASR systems without the need for retraining, which is a significant advantage. The use of flow matching as a deterministic transport mapping is innovative and addresses the common issue of residual distortions in traditional speech enhancement methods.
The experimental setup is robust, utilizing multiple datasets and noise conditions to evaluate the effectiveness of the FM-Refiner. The results consistently demonstrate a reduction in word error rate (WER) across various scenarios, both with and without prior speech enhancement. This thorough evaluation provides strong empirical support for the proposed method's effectiveness, showcasing its applicability across different SE front-ends.
The paper provides sufficient details regarding the architecture of the FM-Refiner and the ASR model, including training parameters and dataset descriptions. However, the lack of a publicly available code repository limits the reproducibility of the results. Including a link to the implementation would greatly enhance the ability for others to replicate the findings.
One limitation of the study is the reliance on pretrained models, which may not generalize well to all ASR tasks or noise conditions. Additionally, while the results are promising, the paper does not explore the computational efficiency of the FM-Refiner during inference, which could be a concern for real-time applications.
The proposed method has significant implications for improving ASR systems in real-world noisy environments, which is crucial for applications in telecommunications, voice-activated systems, and accessibility technologies. By enhancing the robustness of ASR, this research could lead to better user experiences and broader adoption of voice recognition technologies. This paper presents a significant advancement in noise-robust automatic speech recognition through the introduction of a latent-level enhancement technique. The innovative FM-Refiner effectively bridges the gap between noisy and clean representations, demonstrating substantial improvements in ASR performance across various noise conditions.
Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.
Primary: unknown
All Institutions: unknown
The paper presents ChronosAudio, a pioneering benchmark for evaluating long-audio understanding in Audio Large Language Models, highlighting significant performance challenges and laying the groundwork for future advancements in the field.
The methodology presented in this paper is robust, introducing a multi-task benchmark (ChronosAudio) specifically designed for evaluating long-audio understanding in Audio Large Language Models (ALLMs). The authors have structured the benchmark into six distinct task categories, which cover a wide range of audio processing challenges. The use of flash-attention mechanisms and the Hugging Face ecosystem for implementation demonstrates a modern approach to handling the computational complexities associated with long audio sequences. However, the paper could benefit from a clearer explanation of how the tasks were selected and the rationale behind the specific metrics used for evaluation.
The experiments conducted on 16 state-of-the-art models are extensive and yield significant findings regarding the performance degradation of ALLMs when transitioning from short to long contexts. The paper effectively highlights critical issues such as the "Long-Context Collapse" and "Structural Attention Dilution," which are supported by empirical data. However, the results could be strengthened by including more comparative analysis with existing benchmarks and a deeper exploration of the models' performance across different audio lengths.
While the paper provides some implementation details, including hardware specifications and library versions, it lacks a complete description of the experimental setup and data processing steps. This omission could hinder reproducibility. Providing a link to a code repository or detailed supplementary materials would enhance this aspect significantly.
The primary limitation of the study is the focus on a specific subset of tasks that may not encompass all potential applications of ALLMs in long-audio contexts. Additionally, the findings regarding performance degradation could be further contextualized with respect to other models not included in the study.
The introduction of ChronosAudio as a benchmark for long-audio understanding has the potential to significantly influence future research in the field of audio processing and ALLMs. By identifying critical challenges and performance gaps, this work paves the way for developing more effective models capable of handling long-form audio tasks, which are increasingly relevant in various applications, including transcription, localization, and comprehension tasks. The paper presents ChronosAudio, a pioneering benchmark for evaluating long-audio understanding in Audio Large Language Models, highlighting significant performance challenges and laying the groundwork for future advancements in the field.
Speech Emotion Recognition (SER) systems often assume congruence between vocal emotion and lexical semantics. However, in real-world interactions, acoustic-semantic conflict is common yet overlooked, where the emotion conveyed by tone contradicts the literal meaning of spoken words. We show that state-of-the-art SER models, including ASR-based, self-supervised learning (SSL) approaches and Audio Language Models (ALMs), suffer performance degradation under such conflicts due to semantic bias or entangled acoustic-semantic representations. To address this, we propose the Fusion Acoustic-Semantic (FAS) framework, which explicitly disentangles acoustic and semantic pathways and bridges them through a lightweight, query-based attention module. To enable systematic evaluation, we introduce the Conflict in Acoustic-Semantic Emotion (CASE), the first dataset dominated by clear and interpretable acoustic-semantic conflicts in varied scenarios. Extensive experiments demonstrate that FAS consistently outperforms existing methods in both in-domain and zero-shot settings. Notably, on the CASE benchmark, conventional SER models fail dramatically, while FAS sets a new SOTA with 59.38% accuracy. Our code and datasets is available at https://github.com/24DavidHuang/FAS.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the FAS framework and the CASE benchmark, which together provide a novel approach to addressing acoustic-semantic conflict in speech emotion recognition. This work significantly advances the field by offering a systematic evaluation of SER models in challenging real-world scenarios.
The proposed Fusion Acoustic-Semantic (FAS) framework introduces a novel approach to disentangle acoustic and semantic pathways in Speech Emotion Recognition (SER). This is achieved through a lightweight, query-based attention module that effectively integrates the two pathways. The methodology is well-structured, with clear steps for token distillation and feature fusion. The use of a new dataset, the Conflict in Acoustic-Semantic Emotion (CASE), specifically designed to evaluate SER models under conditions of acoustic-semantic conflict, adds significant value to the methodology. The approach is innovative in addressing a previously overlooked challenge in SER, making it a meaningful contribution to the field.
The experiments conducted are extensive and rigorous, demonstrating the effectiveness of the FAS framework across various benchmarks, including in-domain and zero-shot settings. The results indicate that FAS outperforms existing state-of-the-art models, particularly in scenarios characterized by acoustic-semantic conflict. The introduction of the CASE benchmark is a significant advancement, providing a controlled environment for evaluating model robustness. The comprehensive evaluation metrics (accuracy and F1 score) further strengthen the experimental assessment.
The paper provides sufficient implementation details, including the architecture of the FAS framework, training protocols, and hyperparameter settings. The use of a fixed random seed enhances reproducibility. However, the absence of a clear primary institution and specific venue details may hinder the overall reproducibility and recognition of the research.
While the study introduces a valuable framework and dataset, it acknowledges certain limitations, such as the linguistic coverage of the CASE benchmark and its focus on binary acoustic-semantic conflicts. The dataset's size may also limit its utility as a standalone training resource. Additionally, the framework does not account for other contextual cues that could influence emotional expression.
The implications of this research are significant, particularly in applications related to affective computing, human-computer interaction, and emotional AI. By improving the robustness of SER systems in real-world scenarios, this work could enhance user experience in various domains, including customer service, mental health monitoring, and interactive entertainment. However, ethical considerations regarding the potential misuse of such technology must be addressed. The main contribution of this paper is the introduction of the FAS framework and the CASE benchmark, which together provide a novel approach to addressing acoustic-semantic conflict in speech emotion recognition. This work significantly advances the field by offering a systematic evaluation of SER models in challenging real-world scenarios.
Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient's speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data. Extensive experiments show the framework is model-agnostic, robust across languages and conditions, and highly data-efficient-achieving, for instance, 90\% of fully-supervised performance using only 11 labeled samples. This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.
Primary: unknown
All Institutions: unknown
This paper presents a novel semi-supervised learning framework for detecting medical conditions from speech, leveraging multi-level data modeling to enhance performance with minimal labeled data. The technical contributions and methodology are promising, but the paper could improve in clarity and rigor regarding experimental validation and reproducibility.
The proposed methodology introduces a novel semi-supervised learning framework that effectively addresses the challenges of weakly-supervised learning in medical speech analysis. By modeling speech data at three granularity levels (frame, segment, and session), the approach allows for a comprehensive understanding of the acoustic features that correlate with medical conditions. The end-to-end training process and the dynamic generation of pseudo-labels are significant contributions, as they enable the model to leverage unlabeled data efficiently. However, the methodology could benefit from clearer descriptions of the algorithms used for pseudo-label generation and the specific architectures of the models employed.
The experiments conducted are extensive and demonstrate the robustness of the proposed framework across different languages and medical conditions. The use of two distinct datasets (EATD-Corpus and ADReSSo21) adds credibility to the results. The reported performance metrics, particularly achieving 90% of fully-supervised performance with only 11 labeled samples, highlight the effectiveness of the approach. However, the paper lacks detailed statistical analysis of the results, such as confidence intervals or significance testing, which would strengthen the claims made.
The paper provides a link to the code repository, which is a positive aspect for reproducibility. The training details are sufficiently described, including the optimizer settings and data augmentation techniques. However, the lack of specific hyperparameter tuning details and the absence of a clear reproducibility checklist may hinder other researchers from replicating the results accurately.
The primary limitation of the proposed method is its unimodal nature, which restricts the model from utilizing multimodal information that could enhance performance. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other medical conditions or languages not represented in the datasets used. The paper also does not address potential biases in the datasets, which could affect the model's applicability in real-world scenarios.
The implications of this work are significant, as it addresses a critical need for efficient medical diagnosis tools that can operate with limited labeled data. The framework has the potential to improve diagnostic accuracy in clinical settings, particularly in resource-limited environments. Moreover, the model's design could inspire further research into audio-based diagnostics and the application of semi-supervised learning in other healthcare domains. This paper presents a novel semi-supervised learning framework for detecting medical conditions from speech, leveraging multi-level data modeling to enhance performance with minimal labeled data. The technical contributions and methodology are promising, but the paper could improve in clarity and rigor regarding experimental validation and reproducibility.
Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model's implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance, reducing reliance on reference style while preserving text fidelity. On top of this, we apply style-specific LoRAs together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control, and introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance. Experiments show that ReStyle-TTS enables user-friendly, continuous, and relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre, and performs robustly in challenging mismatched reference-target style scenarios.
Primary: Zhejiang University
All Institutions: Ant Group, Shanghai Innovation Institute, Shanghai Jiao Tong University, Xiamen University, Zhejiang University
The paper presents ReStyle-TTS, a novel framework for zero-shot speech synthesis that enables continuous and relative style control while preserving speaker timbre. This work contributes significantly to the field of TTS by addressing limitations in existing methods and providing a robust solution for expressive speech synthesis.
The methodology proposed in ReStyle-TTS is innovative, particularly with the introduction of Decoupled Classifier-Free Guidance (DCFG) and Orthogonal LoRA Fusion (OLoRA). DCFG effectively separates the influences of text and reference audio, allowing for more flexible style control, which is a significant advancement over existing methods that rely on absolute style targets. The use of style-specific LoRAs enables continuous control over multiple attributes, which is a notable improvement in the field of TTS. The Timbre Consistency Optimization (TCO) module is also a valuable addition, addressing the common issue of timbre drift when reducing reference dependency. Overall, the proposed methods are well-structured and demonstrate a clear progression from existing techniques.
The experiments conducted are comprehensive, utilizing a variety of datasets and evaluation metrics, including Word Error Rate (WER) and timbre similarity. The subjective evaluations through Mean Opinion Score for Style Accuracy (MOS-SA) add depth to the assessment of the model's performance. The experiments effectively demonstrate the model's capabilities in handling contradictory-style generation and continuous control over multiple attributes, showcasing its robustness in challenging scenarios. However, the paper could benefit from a more extensive comparison with a broader range of existing methods to highlight its advantages more clearly.
The paper provides sufficient details on the experimental setup, including the training process, datasets used, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider releasing the code and models to facilitate further research and validation of the findings.
A significant limitation mentioned is the scalability of the model to new attributes, which requires additional dataset collection and fine-tuning of LoRAs. This could hinder the practical application of the model in real-world scenarios where rapid adaptation to new styles is necessary. Additionally, while the model performs well in controlled settings, its performance in more diverse and unpredictable real-world conditions remains to be evaluated.
The advancements presented in ReStyle-TTS have the potential to significantly impact the field of speech synthesis, particularly in applications requiring expressive and controllable speech, such as virtual assistants, audiobooks, and gaming. The ability to manipulate speech styles continuously and relative to reference audio could enhance user experiences across various domains, making TTS systems more versatile and user-friendly. The paper presents ReStyle-TTS, a novel framework for zero-shot speech synthesis that enables continuous and relative style control while preserving speaker timbre. This work contributes significantly to the field of TTS by addressing limitations in existing methods and providing a robust solution for expressive speech synthesis.
Temporal detection problems appear in many fields including time-series estimation, activity recognition and sound event detection (SED). In this work, we propose a new approach to temporal event modeling by explicitly modeling event onsets and offsets, and by introducing boundary-aware optimization and inference strategies that substantially enhance temporal event detection. The presented methodology incorporates new temporal modeling layers - Recurrent Event Detection (RED) and Event Proposal Network (EPN) - which, together with tailored loss functions, enable more effective and precise temporal event detection. We evaluate the proposed method in the SED domain using a subset of the temporally-strongly annotated portion of AudioSet. Experimental results show that our approach not only outperforms traditional frame-wise SED models with state-of-the-art post-processing, but also removes the need for post-processing hyperparameter tuning, and scales to achieve new state-of-the-art performance across all AudioSet Strong classes.
Primary: Institute of Computational Perception
All Institutions: Institute of Computational Perception, Linz Institute of Technology, Meta Reality Labs Research
The paper presents a novel method for sound event detection that enhances temporal localization through boundary-aware optimization and inference strategies. The technical contributions, including the RED layer and EPNs, provide a significant advancement in the field, addressing existing limitations in traditional models and demonstrating state-of-the-art performance on a challenging dataset.
The proposed methodology introduces significant advancements in sound event detection (SED) by focusing on the precise modeling of event boundaries through the Recurrent Event Detection (RED) layer and Event Proposal Networks (EPNs). The integration of these components allows for direct prediction of event regions without the need for post-processing, which is a common bottleneck in traditional SED approaches. The use of tailored loss functions, including focal loss for onset and offset probabilities, enhances the model's ability to learn from imbalanced datasets effectively. This end-to-end framework is innovative and addresses critical limitations in existing methods.
The experiments are well-structured, utilizing a substantial dataset (AudioSet Strong) for evaluation. The authors demonstrate clear improvements over baseline methods, including median filtering and state-of-the-art post-processing techniques. The results show that the proposed method achieves a new state-of-the-art PSDS1 score, indicating its effectiveness across various classes. The evaluation metrics are appropriate for the task, and the authors provide a thorough analysis of the contributions of each component of their method.
The paper provides sufficient details regarding the architecture, training procedures, and datasets used, which facilitates reproducibility. However, the absence of a publicly available code repository or demo limits the ability for others to directly replicate the results. The authors mention the use of specific hyperparameters and training configurations, which are crucial for reproducing the experiments.
A notable limitation is the lack of real-time inference capability, which the authors acknowledge and plan to address in future work. Additionally, while the method shows significant improvements, the performance on rare classes remains a concern, as the dataset is highly imbalanced. The reliance on frame-wise models may also limit the generalizability of the approach to other audio recognition tasks beyond SED.
This research has the potential to significantly impact various applications, including smart home systems, healthcare monitoring, and security surveillance, where accurate sound event detection is crucial. The ability to detect and localize sounds in real-time could enhance user experiences and enable more intelligent systems. The proposed method's elimination of post-processing hyperparameters could streamline the deployment of SED systems in real-world applications. The paper presents a novel method for sound event detection that enhances temporal localization through boundary-aware optimization and inference strategies. The technical contributions, including the RED layer and EPNs, provide a significant advancement in the field, addressing existing limitations in traditional models and demonstrating state-of-the-art performance on a challenging dataset.
Multi-speaker automatic speech recognition (MASR) aims to predict ''who spoke when and what'' from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ''when'' and ''who'': some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ''when'' and ''who''. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.
Primary: Inner Mongolia University
All Institutions: Inner Mongolia University
The paper presents TellWhisper, a unified framework for multi-speaker automatic speech recognition that effectively integrates temporal and speaker dynamics. This innovative approach has the potential to significantly advance the field of speech recognition, particularly in complex multi-speaker environments.
The methodology presented in this paper is innovative, particularly with the introduction of the TS-RoPE (time-speaker rotary positional encoding) that integrates temporal and speaker information directly into the speech encoder. This approach addresses the common pitfalls of existing MASR systems that treat speaker and temporal modeling separately, which often leads to performance degradation in overlapping speech scenarios. The use of hyperbolic space for speaker classification through Hyper-SD is also a novel contribution that enhances inter-class separation, thus providing a robust framework for speaker activity estimation. The overall architecture is well-structured, with clear delineation of components and their interactions.
The experiments conducted are extensive and well-designed, utilizing multiple datasets that reflect real-world conditions. The evaluation metrics are appropriate for the task, particularly given the complexities of speaker attribution and temporal alignment in MASR. The results indicate significant improvements over baseline models, showcasing the effectiveness of the proposed methods. The ablation studies provide valuable insights into the contribution of each component, reinforcing the robustness of the findings.
While the paper provides a detailed description of the methodology and experimental setup, the lack of a publicly available code repository or demo limits reproducibility. Clear hyperparameter settings and training strategies are discussed, but without access to the implementation, it would be challenging for other researchers to replicate the results.
One limitation is the absence of a demo or project URL, which would facilitate further exploration of the proposed methods. Additionally, while the paper demonstrates improvements over existing methods, it does not fully address potential scalability issues or performance in highly diverse real-world environments. The reliance on hyperbolic space may also introduce complexity in understanding and interpreting the model's behavior.
The implications of this research are significant for applications in multi-party dialogue systems, enhancing the capabilities of automatic transcription services, virtual assistants, and collaborative tools. Improved MASR can lead to better accessibility and usability in various domains, including education, business, and healthcare. The paper presents TellWhisper, a unified framework for multi-speaker automatic speech recognition that effectively integrates temporal and speaker dynamics. This innovative approach has the potential to significantly advance the field of speech recognition, particularly in complex multi-speaker environments.
In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the development of IndexTTS 2.5, which enhances zero-shot multilingual emotional text-to-speech synthesis through innovative architectural improvements and modeling strategies. This work represents a significant step forward in the field of TTS, particularly in its ability to handle multiple languages and emotional nuances effectively.
The methodology presented in IndexTTS 2.5 is robust, featuring a well-structured approach to enhancing a zero-shot multilingual text-to-speech (TTS) system. The authors introduce four significant improvements: semantic codec compression, architectural upgrades to the S2M module, multilingual modeling strategies, and reinforcement learning optimization. Each component is clearly defined and justified, showcasing a thoughtful integration of existing techniques with novel enhancements. The multilingual extension strategies, particularly the boundary-aware alignment and token-level concatenation, are innovative and address specific challenges in cross-lingual TTS, which is a notable contribution to the field.
The experimental evaluation is comprehensive, utilizing a substantial dataset of approximately 100K hours of speech data across multiple languages. The authors employ a variety of metrics, including WER, speaker similarity, emotional similarity, and MOS, to assess the performance of their model. The results demonstrate significant improvements over previous models, particularly in emotional expressiveness and multilingual capabilities. The comparative analysis of different modeling strategies provides valuable insights into their effectiveness, although the paper could benefit from additional qualitative assessments or user studies to further validate the findings.
The paper provides a detailed account of the training and evaluation datasets, methodologies, and metrics used, which supports reproducibility. However, the lack of specific implementation details, such as code availability or model weights, limits the ability for others to fully replicate the results. Including a link to a code repository or supplementary materials would enhance reproducibility.
While the paper presents significant advancements, it does not address potential limitations in handling languages with complex morphological structures or dialectal variations. Additionally, the reliance on a large dataset may pose challenges for deployment in resource-constrained environments. The performance on languages with less training data, such as Spanish, may also require further investigation.
The advancements in multilingual TTS systems have the potential to significantly impact various applications, including virtual assistants, audiobooks, and language learning tools. By improving emotional expressiveness and reducing inference latency, IndexTTS 2.5 could enhance user experiences in diverse linguistic contexts. However, ethical considerations regarding the use of synthetic voices, particularly in emotional contexts, should be addressed to prevent misuse. The main contribution of this paper is the development of IndexTTS 2.5, which enhances zero-shot multilingual emotional text-to-speech synthesis through innovative architectural improvements and modeling strategies. This work represents a significant step forward in the field of TTS, particularly in its ability to handle multiple languages and emotional nuances effectively.
Training a unified model integrating video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant application flexibility, yet faces two unexplored foundational challenges: (1) the scarcity of high-quality audio captions with tight A-V-T alignment, leading to severe semantic conflict between multimodal conditions, and (2) cross-task and intra-task competition, manifesting as an adverse V2A-T2A performance trade-off and modality bias in the VT2A task. First, to address data scarcity, we introduce SoundAtlas, a large-scale dataset (470k pairs) that significantly outperforms existing benchmarks and even human experts in quality. Powered by a novel agentic pipeline, it integrates Vision-to-Language Compression to mitigate visual bias of MLLMs, a Junior-Senior Agent Handoff for a 5 times cost reduction, and rigorous Post-hoc Filtering to ensure fidelity. Consequently, SoundAtlas delivers semantically rich and temporally detailed captions with tight V-A-T alignment. Second, we propose Omni2Sound, a unified VT2A diffusion model supporting flexible input modalities. To resolve the inherent cross-task and intra-task competition, we design a three-stage multi-task progressive training schedule that converts cross-task competition into joint optimization and mitigates modality bias in the VT2A task, maintaining both audio-visual alignment and off-screen audio generation faithfulness. Finally, we construct VGGSound-Omni, a comprehensive benchmark for unified evaluation, including challenging off-screen tracks. With a standard DiT backbone, Omni2Sound achieves unified SOTA performance across all three tasks within a single model, demonstrating strong generalization across benchmarks with heterogeneous input conditions. The project page is at https://swapforward.github.io/Omni2Sound.
Primary: Monash University
All Institutions: China Monash University, Monash University, Shengshu AI, Tsinghua University
The paper presents a significant advancement in unified video-text-to-audio generation through the introduction of the SoundAtlas dataset and the Omni2Sound model, addressing critical challenges in the field. The comprehensive methodology and robust experimental evaluation highlight its potential impact on future research and applications in audio generation.
The paper presents a comprehensive methodology that addresses two foundational challenges in unified audio generation: data scarcity and task competition. The introduction of the SoundAtlas dataset, which significantly enhances the quality and alignment of audio captions, is a notable contribution. The Omni2Sound model employs a diffusion-based architecture and a three-stage progressive training schedule, effectively mitigating cross-task and intra-task competition. The methodology is well-structured, leveraging advanced multimodal techniques and a decoupled training approach that enhances audio-visual alignment and generation fidelity.
The experimental evaluation is robust, utilizing a large-scale dataset and a comprehensive benchmark (VGGSound-Omni) to assess the performance of the Omni2Sound model across multiple tasks (V2A, T2A, VT2A). The results demonstrate state-of-the-art performance, with significant improvements over existing models. The paper includes extensive ablation studies that validate the proposed methods and their effectiveness in resolving competition between tasks. However, the paper could benefit from more detailed discussions on the statistical significance of the results.
The paper provides a clear description of the methodologies and datasets used, but it lacks specific implementation details that would facilitate reproducibility. While the authors mention the use of a standard DiT backbone, further information on hyperparameters, training procedures, and code availability would enhance reproducibility.
One limitation is the reliance on the quality of the SoundAtlas dataset, which, while claimed to outperform existing datasets, may still have inherent biases or limitations that could affect the model's performance. Additionally, the paper does not address potential scalability issues when applying the model to larger datasets or real-world applications.
The proposed unified model has significant implications for various applications in audio generation, including film production, video games, and virtual reality, where high-quality audio generation from multimodal inputs is crucial. The advancements in audio-visual alignment and the introduction of a comprehensive benchmark could drive further research in the field, fostering innovation in multimodal AI systems. The paper presents a significant advancement in unified video-text-to-audio generation through the introduction of the SoundAtlas dataset and the Omni2Sound model, addressing critical challenges in the field. The comprehensive methodology and robust experimental evaluation highlight its potential impact on future research and applications in audio generation.
Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction, which also aim to bridge the gap between continuous and discrete signals. Existing NACs can be divided into two categories: multi-codebook and single-codebook codecs. Multi-codebook codecs face challenges such as structural complexity and difficulty in adapting to downstream tasks, while single-codebook codecs, though structurally simpler, suffer from low-fidelity, ineffective modeling of unified audio, and an inability to support modeling of high-frequency audio. We propose the UniSRCodec, a single-codebook codec capable of supporting high sampling rate, low-bandwidth, high fidelity, and unified. We analyze the inefficiency of waveform-based compression and introduce the time and frequency compression method using the Mel-spectrogram, and cooperate with a Vocoder to recover the phase information of the original audio. Moreover, we propose a sub-band reconstruction technique to achieve high-quality compression across both low and high frequency bands. Subjective and objective experimental results demonstrate that UniSRCodec achieves state-of-the-art (SOTA) performance among cross-domain single-codebook codecs with only a token rate of 40, and its reconstruction quality is comparable to that of certain multi-codebook methods. Our demo page is available at https://wxzyd123.github.io/unisrcodec.
Primary: Tsinghua University
All Institutions: Tsinghua University, ModelBest Inc.
The paper presents UniSRCodec, a unified and low-bitrate single-codebook neural audio codec that employs innovative techniques for high-fidelity audio modeling. Its contributions to the field of audio processing are substantial, addressing critical challenges in codec design and demonstrating impressive performance across diverse audio domains.
The paper introduces the UniSRCodec, a novel single-codebook neural audio codec that addresses the limitations of existing codecs by employing a time and frequency compression method using Mel-spectrograms and a sub-band reconstruction technique. The methodology is well-structured, with a clear explanation of the encoder-quantizer-decoder architecture and the rationale behind using Mel-spectrograms. The sub-band reconstruction approach is particularly innovative, allowing for improved modeling of both low and high-frequency audio signals. The paper also discusses the training procedure and the use of various loss functions effectively, contributing to the overall robustness of the proposed method.
The experimental evaluation is thorough, utilizing a diverse set of cross-domain datasets that include speech, music, and general sound. The results demonstrate that UniSRCodec achieves state-of-the-art performance among single-codebook codecs, with a low token rate of 40 and a bitrate of 0.52 kbps. The paper includes both subjective and objective evaluations, providing a comprehensive assessment of the codec's performance. The ablation studies further validate the effectiveness of the proposed components, such as the sub-band reconstruction and the discriminator.
The paper provides sufficient details regarding the architecture, training procedure, and datasets used, which aids in reproducibility. However, the lack of a public code repository or detailed implementation instructions may hinder complete reproducibility by other researchers.
One limitation noted is that while the codec performs well across various audio domains, its performance in the speech domain is somewhat lower compared to existing models like UniCodec, primarily due to differences in training data. Additionally, the reliance on a relatively small number of GPUs for training may limit accessibility for some researchers.
The UniSRCodec has significant implications for audio transmission and processing, particularly in low-bitrate scenarios. Its ability to maintain high fidelity while reducing bandwidth requirements could enhance applications in streaming services, telecommunications, and audio processing in resource-constrained environments. The codec's design could also facilitate advancements in audio understanding tasks, potentially benefiting various machine learning applications. The paper presents UniSRCodec, a unified and low-bitrate single-codebook neural audio codec that employs innovative techniques for high-fidelity audio modeling. Its contributions to the field of audio processing are substantial, addressing critical challenges in codec design and demonstrating impressive performance across diverse audio domains.
Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ($4\to 16$~kHz) and full-band ($16\to 48$~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.
Primary: Tampere University
All Institutions: Tampere University
This paper makes a substantial contribution to the field of audio super-resolution by introducing a novel evaluation framework that reveals critical insights into the performance of generative models. The combination of embedding-based classifiers and traditional metrics offers a more nuanced understanding of how well these models replicate real audio distributions, paving the way for future advancements in the area.
The paper introduces a novel approach to evaluating audio super-resolution models by employing embedding-based classifiers to assess the separability of real and synthetic audio samples. The methodology is well-structured, utilizing both linear classifiers and various embedding spaces, including those derived from GAN discriminators and external models like OpenL3 and log-Mel spectrograms. This dual approach allows for a comprehensive analysis of the performance of different ADSR models across multiple tasks, which is a significant advancement over traditional evaluation metrics that often overlook perceptual fidelity.
The experiments are robust, involving multiple datasets (VCTK and FMA-small) and a variety of models (AudioUNet, MU-GAN, HiFi-GAN, FlowHigh, and FlashSR). The use of both objective metrics (SNR, LSD) and subjective listening tests (MUSHRA) provides a well-rounded evaluation of the models' performance. The results indicate a clear distinction between real and synthetic audio, even when the latter achieves high perceptual quality, which underscores the effectiveness of the proposed evaluation framework.
The paper provides sufficient implementation details, including training protocols, hyperparameters, and the use of publicly available code for models like HiFi-GAN, FlowHigh, and FlashSR. This transparency enhances reproducibility, allowing other researchers to replicate the experiments and validate the findings. The availability of the code and demo further supports this aspect.
One limitation noted is the reliance on linear classifiers, which may not capture the full complexity of the audio data. Additionally, the study primarily focuses on specific datasets, which may limit the generalizability of the findings to other audio domains or conditions. The paper also acknowledges the persistent gap between perceptual quality and distributional fidelity, indicating that further research is needed to bridge this divide.
This research has significant implications for the field of audio processing, particularly in applications involving speech and music synthesis. By highlighting the limitations of current evaluation metrics, the study encourages the development of more comprehensive assessment frameworks for generative models. This could lead to improvements in audio quality and realism in various applications, including virtual assistants, music production, and entertainment. This paper makes a substantial contribution to the field of audio super-resolution by introducing a novel evaluation framework that reveals critical insights into the performance of generative models. The combination of embedding-based classifiers and traditional metrics offers a more nuanced understanding of how well these models replicate real audio distributions, paving the way for future advancements in the area.
Recent advances in audio large language models (ALLMs) have made high-quality synthetic audio widely accessible, increasing the risk of malicious audio deepfakes across speech, environmental sounds, singing voice, and music. Real-world audio deepfake detection (ADD) therefore requires all-type detectors that generalize across heterogeneous audio and provide interpretable decisions. Given the strong multi-task generalization ability of ALLMs, we first investigate their performance on all-type ADD under both supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). However, SFT using only binary real/fake labels tends to reduce the model to a black-box classifier, sacrificing interpretability. Meanwhile, vanilla RFT under sparse supervision is prone to reward hacking and can produce hallucinated, ungrounded rationales. To address this, we propose an automatic annotation and polishing pipeline that constructs Frequency-Time structured chain-of-thought (CoT) rationales, producing ~340K cold-start demonstrations. Building on CoT data, we propose Frequency Time-Group Relative Policy Optimization (FT-GRPO), a two-stage training paradigm that cold-starts ALLMs with SFT and then applies GRPO under rule-based frequency-time constraints. Experiments demonstrate that FT-GRPO achieves state-of-the-art performance on all-type ADD while producing interpretable, FT-grounded rationales. The data and code are available online.
Primary: Communication University of China
All Institutions: Communication University of China, Institute of Automation, Ant Group, Chinese Academy of Sciences
The main contribution of this paper is the introduction of a two-stage training paradigm for audio deepfake detection that enhances model interpretability and performance across diverse audio types. This work significantly advances the field by addressing the challenges of generalization and interpretability in deepfake detection, providing a robust framework that could be applied to future research and practical applications.
The paper presents a novel approach to audio deepfake detection using audio large language models (ALLMs) through a two-stage training paradigm, FT-GRPO. The methodology is well-structured, beginning with the construction of a large dataset of rationales via an automatic annotation and polishing pipeline. This is followed by a cold-start supervised fine-tuning (SFT) phase and a reinforcement fine-tuning (RFT) phase that incorporates frequency-time constraints. The use of chain-of-thought (CoT) rationales is innovative, as it aims to enhance interpretability while addressing the limitations of traditional SFT approaches that reduce models to black-box classifiers. The integration of non-think samples to improve model robustness is also a notable contribution.
The experiments are comprehensive, evaluating the proposed method against multiple audio types (speech, sound, singing, music) and demonstrating state-of-the-art performance. The authors provide detailed accuracy results across different models and training types, showcasing the effectiveness of their approach. However, the paper could benefit from a more extensive discussion of the datasets used and the specific metrics employed for evaluation beyond accuracy, such as precision, recall, or F1-score.
The paper includes a URL for accessing the code and data, which is crucial for reproducibility. The implementation details are described, including hyperparameters and training configurations. However, the paper could improve by providing a more detailed explanation of the experimental setup, such as the specific hardware used and any software dependencies.
The authors acknowledge several limitations, including the focus on a limited set of ALLMs, potential suboptimal annotations, and the non-exhaustive nature of the testing conditions. These limitations may affect the generalizability of the findings and the robustness of the model in real-world scenarios.
The research addresses a critical issue in the realm of audio deepfake detection, which has significant implications for media integrity, cybersecurity, and misinformation. By improving the interpretability and accuracy of detection methods, this work could contribute to the development of more reliable systems for identifying synthetic audio, thereby enhancing trust in audio content across various applications. The main contribution of this paper is the introduction of a two-stage training paradigm for audio deepfake detection that enhances model interpretability and performance across diverse audio types. This work significantly advances the field by addressing the challenges of generalization and interpretability in deepfake detection, providing a robust framework that could be applied to future research and practical applications.
Existing large audio-language models perceive the world as "mono" -- a single stream of audio that ignores the critical spatial dimension ("where") required for universal acoustic scene analysis. To bridge this gap, we first introduce a hierarchical framework for Auditory Scene Analysis (ASA). Guided by this framework, we introduce a system that enables models like Qwen2-Audio to understand and reason about the complex acoustic world. Our framework achieves this through three core contributions: First, we build a large-scale, synthesized binaural audio dataset to provide the rich spatial cues. Second, we design a hybrid feature projector, which leverages parallel semantic and spatial encoders to extract decoupled representations. These distinct streams are integrated via a dense fusion mechanism, ensuring the model receives a holistic view of the acoustic scene. Finally, we employ a progressive training curriculum, advancing from supervised fine-tuning (SFT) to reinforcement learning via Group Relative Policy Optimization (GRPO), to explicitly evolve the model's capabilities towards reasoning. On our comprehensive benchmark, the model demonstrates comparatively strong capability for spatial understanding. By enabling this spatial perception, our work provides a clear pathway for leveraging the powerful reasoning abilities of large models towards holistic acoustic scene analysis, advancing from "mono" semantic recognition to spatial intelligence.
Primary: Peking University
All Institutions: Peking University
This paper presents a novel framework that significantly enhances spatial understanding in large audio-language models. The combination of a structured methodology, comprehensive evaluation, and potential applications marks a meaningful contribution to the field of machine learning and audio processing.
The proposed methodology introduces a hierarchical framework for Auditory Scene Analysis (ASA), which is a significant advancement in the field of audio-language models. The authors effectively decouple semantic and spatial representations through a hybrid feature projector and employ a progressive training curriculum that transitions from supervised fine-tuning to reinforcement learning. This structured approach not only enhances the model's ability to understand spatial audio but also provides a clear pathway for integrating reasoning capabilities into large audio-language models. However, the reliance on synthetic data for training raises questions about the generalizability of the results to real-world scenarios.
The experiments are well-structured, utilizing a comprehensive benchmark designed to evaluate spatial reasoning capabilities across three layers of the ASA framework. The results demonstrate a clear improvement over existing models, particularly in complex reasoning tasks. However, the paper could benefit from additional comparisons with a wider range of state-of-the-art models and a more detailed analysis of the failure cases, especially in the Perception Count task.
The authors commit to reproducibility by providing access to datasets and implementation details. However, the lack of a publicly available code repository or demo limits the ability of other researchers to fully replicate the experiments. The detailed training setup and hyperparameter configurations are commendable, but a more accessible format for sharing code would enhance reproducibility.
One notable limitation is the model's performance in the Perception Count task, which remains significantly lower than other tasks. This suggests that while the model excels in spatial semantics, it struggles with counting or quantifying sound sources. Additionally, the use of synthetic data may not capture the complexities of real-world audio environments, potentially limiting the model's applicability in practical scenarios.
The work has significant implications for various applications, including robotics, augmented reality, and assistive technologies, where spatial audio understanding is crucial. By advancing the capabilities of audio-language models to include spatial reasoning, this research paves the way for more sophisticated auditory scene analysis, which could enhance user experiences in immersive environments. This paper presents a novel framework that significantly enhances spatial understanding in large audio-language models. The combination of a structured methodology, comprehensive evaluation, and potential applications marks a meaningful contribution to the field of machine learning and audio processing.
As audio deepfakes transition from research artifacts to widely available commercial tools, robust biometric authentication faces pressing security threats in high-stakes industries. This paper presents a systematic empirical evaluation of state-of-the-art speaker authentication systems based on a large-scale speech synthesis dataset, revealing two major security vulnerabilities: 1) modern voice cloning models trained on very small samples can easily bypass commercial speaker verification systems; and 2) anti-spoofing detectors struggle to generalize across different methods of audio synthesis, leading to a significant gap between in-domain performance and real-world robustness. These findings call for a reconsideration of security measures and stress the need for architectural innovations, adaptive defenses, and the transition towards multi-factor authentication.
Primary: Hong Kong Polytechnic University
All Institutions: Hong Kong Polytechnic University
The paper presents a systematic evaluation of audio-based biometric authentication systems against deepfake speech synthesis, revealing critical vulnerabilities that necessitate a reevaluation of current security measures. The comprehensive methodology and significant findings underscore the urgent need for enhanced defenses in the face of rapidly evolving audio synthesis technologies.
The paper employs a systematic empirical evaluation framework that integrates state-of-the-art speaker verification models and anti-spoofing detectors against various voice cloning systems. The methodology is robust, utilizing a large-scale benchmark dataset and multiple representative voice cloning systems to assess vulnerabilities. The authors detail the architectures used, including ECAPA-TDNN for speaker verification and XLS-R combined with AASIST for deepfake detection, demonstrating a well-thought-out approach to evaluate the effectiveness of these systems under realistic attack scenarios.
The experiments are comprehensive, covering both in-domain and out-of-domain scenarios, which is crucial for understanding the generalization capabilities of the models. The results indicate significant vulnerabilities, particularly in the context of unseen synthesis methods, which is a critical finding for the field. The use of diverse synthesis paradigms in the evaluation enhances the relevance of the findings, although specific numerical results and tables referenced in the text are not provided in the excerpt.
The authors commit to releasing their code and dataset upon acceptance, which is a positive aspect for reproducibility. They provide detailed implementation specifics, including training epochs, batch sizes, and optimization techniques, which further support reproducibility. However, the actual performance metrics and results tables are not included in the provided text, which could hinder full reproducibility without access to the complete paper.
The paper acknowledges limitations, particularly regarding the scale of training data and the complexity of cross-lingual evaluations. While the study focuses on small sample sizes for voice cloning, it suggests that larger datasets could yield different insights. Additionally, the challenges of generalization across languages and the reliance on specific training conditions for deepfake detection are noted as areas for further exploration.
This research has significant implications for the security of audio-based biometric systems, especially in high-stakes industries. The findings highlight the urgent need for improved defenses against evolving deepfake technologies, suggesting that current systems may not be sufficient to protect against sophisticated attacks. The call for multi-factor authentication and architectural innovations could lead to advancements in the field, influencing both academic research and practical applications in security. The paper presents a systematic evaluation of audio-based biometric authentication systems against deepfake speech synthesis, revealing critical vulnerabilities that necessitate a reevaluation of current security measures. The comprehensive methodology and significant findings underscore the urgent need for enhanced defenses in the face of rapidly evolving audio synthesis technologies.
With the development of teleconferencing and in-vehicle voice assistants, far-field multi-speaker speech recognition has become a hot research topic. Recently, a multi-channel transformer (MCT) has been proposed, which demonstrates the ability of the transformer to model far-field acoustic environments. However, MCT cannot encode high-dimensional acoustic features for each speaker from mixed input audio because of the interference between speakers. Based on these, we propose the multi-channel multi-speaker transformer (M2Former) for far-field multi-speaker ASR in this paper. Experiments on the SMS-WSJ benchmark show that the M2Former outperforms the neural beamformer, MCT, dual-path RNN with transform-average-concatenate and multi-channel deep clustering based end-to-end systems by 9.2%, 14.3%, 24.9%, and 52.2% respectively, in terms of relative word error rate reduction.
Primary: AI Engineering System
All Institutions: AI Engineering System
The paper presents a significant advancement in multi-speaker speech recognition by introducing the M2Former, which effectively decouples speaker-specific features and reduces interference, thereby improving performance in challenging acoustic environments.
The proposed M2Former introduces a novel architecture that combines multi-channel processing with a transformer-based approach, effectively addressing the challenge of interference in multi-speaker scenarios. The use of 2D CNN for feature decoupling, along with the innovative multi-channel multi-speaker attention (M2A) mechanism, allows for better contextual encoding of speaker-specific features. This methodology is well-justified, with clear explanations of how each component contributes to the overall performance, particularly in mitigating interference from overlapping speakers.
The experiments conducted on the SMS-WSJ benchmark are comprehensive, showcasing the effectiveness of M2Former against several baseline models. The reported improvements in word error rate reduction are significant, particularly the 52.2% improvement over multi-channel deep clustering systems. The ablation studies further validate the contributions of individual components, reinforcing the robustness of the proposed method.
While the paper provides a detailed description of the architecture and experimental setup, including configurations for baselines and the proposed model, it lacks specific implementation details such as code availability or links to datasets. This limits the reproducibility of the results, as external researchers may struggle to replicate the findings without access to the exact implementations.
One limitation is the reliance on the SMS-WSJ dataset, which may not fully represent the diversity of real-world multi-speaker environments. Additionally, while the model shows promise with varying numbers of speakers, its performance with a significantly larger number of speakers or in more complex acoustic environments remains untested.
The advancements in multi-speaker speech recognition have significant implications for applications in teleconferencing and voice assistants, enhancing user experience in noisy environments. The proposed model's ability to effectively separate and recognize speech from multiple sources could lead to more robust and efficient voice interaction systems in various domains. The paper presents a significant advancement in multi-speaker speech recognition by introducing the M2Former, which effectively decouples speaker-specific features and reduces interference, thereby improving performance in challenging acoustic environments.
Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.
Primary: Tencent Hunyuan
All Institutions: Tencent Hunyuan
The main contribution of this paper is the introduction of FCaps, a large-scale dataset for fine-grained speech-text representation, and the CLSP model that leverages this dataset to achieve superior performance in various speech-related tasks. This work represents a meaningful step forward in bridging the gap between language and speech processing, addressing a critical challenge in the field.
The methodology presented in this paper is innovative, particularly in the development of the FCaps dataset, which utilizes a novel end-to-end pipeline for generating fine-grained style annotations directly from audio. This approach mitigates the common issues associated with cascaded pipelines, such as error propagation. The integration of global and fine-grained supervision in the CLSP model is a significant advancement, allowing for a more nuanced understanding of speech-text relationships. However, the paper could benefit from a more detailed description of the end-to-end pipeline and the specific techniques used for grounding captions in audio.
The experiments conducted are extensive and cover a range of tasks, including speech-text retrieval and paralinguistic classification. The use of LLM-as-a-judge for evaluating the quality of annotations is a novel approach that adds credibility to the results. The paper presents strong empirical evidence demonstrating the effectiveness of CLSP across various tasks, with results that align well with human judgments. However, the paper could improve by providing more comparative analyses against state-of-the-art models to better contextualize its contributions.
The authors have made the code and dataset publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics that would facilitate replication of the experiments. Including hyperparameters, training procedures, and evaluation metrics in a more structured manner would enhance reproducibility.
One limitation is the reliance on the quality of the fine-grained annotations, which, while improved, may still be subject to biases inherent in the dataset creation process. Additionally, the scalability of the proposed methods to other languages or dialects is not addressed, which could limit the applicability of the findings. The paper also does not discuss potential ethical implications of using large-scale datasets in this context.
The advancements in fine-grained language-speech representation have significant implications for various applications, including speech recognition, virtual assistants, and accessibility technologies. By improving the understanding of speaking styles, this research could enhance user interactions with AI systems, making them more natural and effective. The approach could also inspire future research in related areas, promoting further innovations in multimodal learning. The main contribution of this paper is the introduction of FCaps, a large-scale dataset for fine-grained speech-text representation, and the CLSP model that leverages this dataset to achieve superior performance in various speech-related tasks. This work represents a meaningful step forward in bridging the gap between language and speech processing, addressing a critical challenge in the field.
Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the MoE-Adapter, a novel architecture that mitigates gradient conflicts in audio language models by employing a dynamic gating mechanism to specialize expert models for diverse acoustic features. This advancement is poised to significantly enhance the performance of multimodal perception systems in machine learning.
The proposed MoE-Adapter introduces a novel sparse Mixture-of-Experts architecture that effectively addresses the issue of gradient conflict in audio language models. The dynamic gating mechanism is particularly noteworthy as it allows for the specialization of experts in capturing diverse acoustic features, which is a significant advancement over traditional dense parameter-sharing approaches. The methodology is well-structured, and the focus on disentangling heterogeneous audio information is a critical contribution to the field.
The experiments conducted are comprehensive, demonstrating the MoE-Adapter's superior performance on various audio semantic and paralinguistic tasks. The results are compelling, showing consistent improvements over dense linear baselines while maintaining comparable computational costs. However, further details on the datasets used and the specific metrics for evaluation would enhance the robustness of the experimental validation.
The paper mentions the intention to release code and models, which is a positive aspect for reproducibility. However, without access to the full implementation details in the paper itself, it is difficult to fully assess how easily others can replicate the results. Clear documentation and availability of the code will be crucial for future researchers.
One limitation noted is the potential complexity introduced by the dynamic gating mechanism, which may require careful tuning. Additionally, while the paper addresses gradient conflicts, it does not explore the trade-offs between model size and performance in depth, which could be an area for further investigation.
The MoE-Adapter has significant implications for the development of multimodal models that can process audio inputs effectively. By improving the handling of diverse acoustic information, this research could enhance applications in speech recognition, music analysis, and environmental sound classification, thereby broadening the scope of audio language models in real-world scenarios. The main contribution of this paper is the introduction of the MoE-Adapter, a novel architecture that mitigates gradient conflicts in audio language models by employing a dynamic gating mechanism to specialize expert models for diverse acoustic features. This advancement is poised to significantly enhance the performance of multimodal perception systems in machine learning.
Advanced speech synthesis technologies have enabled highly realistic speech generation, posing security risks that motivate research into audio deepfake detection (ADD). While state space models (SSMs) offer linear complexity, pure causal SSMs architectures often struggle with the content-based retrieval required to capture global frequency-domain artifacts. To address this, we explore the scaling properties of hybrid architectures by proposing XLSR-MamBo, a modular framework integrating an XLSR front-end with synergistic Mamba-Attention backbones. We systematically evaluate four topological designs using advanced SSM variants, Mamba, Mamba2, Hydra, and Gated DeltaNet. Experimental results demonstrate that the MamBo-3-Hydra-N3 configuration achieves competitive performance compared to other state-of-the-art systems on the ASVspoof 2021 LA, DF, and In-the-Wild benchmarks. This performance benefits from Hydra's native bidirectional modeling, which captures holistic temporal dependencies more efficiently than the heuristic dual-branch strategies employed in prior works. Furthermore, evaluations on the DFADD dataset demonstrate robust generalization to unseen diffusion- and flow-matching-based synthesis methods. Crucially, our analysis reveals that scaling backbone depth effectively mitigates the performance variance and instability observed in shallower models. These results demonstrate the hybrid framework's ability to capture artifacts in spoofed speech signals, providing an effective method for ADD.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the XLSR-MamBo framework, which effectively combines SSM and Attention mechanisms for improved audio deepfake detection. This work represents a significant advancement in the field, addressing critical challenges in detecting sophisticated spoofing techniques while providing a modular and scalable architecture for future research.
The proposed XLSR-MamBo framework innovatively integrates XLSR with hybrid SSM-Attention architectures, effectively addressing the limitations of traditional models in audio deepfake detection. The systematic evaluation of various topological designs and the exploration of advanced SSM variants demonstrate a thoughtful approach to enhancing model performance. The use of Hydra for bidirectional modeling is particularly noteworthy, as it allows for capturing complex dependencies without the redundancy of previous methods.
The experiments are comprehensive, utilizing multiple datasets (ASVspoof 2021 LA, DF, ITW, DFADD) to assess generalization and robustness. The results indicate that the MamBo-3-Hydra-N3 configuration achieves competitive performance, with detailed metrics reported, including EER and min t-DCF. However, the evaluation could benefit from comparisons against a broader range of state-of-the-art models to contextualize the findings further.
The paper provides a GitHub repository link for code access, which is essential for reproducibility. However, detailed implementation specifics, such as hyperparameter settings and training procedures, could be more thoroughly documented to facilitate replication of results.
The study acknowledges limitations, including reliance on a single-source training paradigm and potential linguistic bias due to the focus on English datasets. The rapid convergence observed during training raises questions about the generalization capabilities of the models, which should be explored in future work.
The implications of this research are significant, as it addresses the growing concern of audio deepfakes, which pose risks in misinformation and security. The proposed framework could enhance the reliability of voice biometric systems and contribute to the development of more robust anti-spoofing technologies. The main contribution of this paper is the introduction of the XLSR-MamBo framework, which effectively combines SSM and Attention mechanisms for improved audio deepfake detection. This work represents a significant advancement in the field, addressing critical challenges in detecting sophisticated spoofing techniques while providing a modular and scalable architecture for future research.
While controllable Text-to-Speech (TTS) has achieved notable progress, most existing methods remain limited to inter-utterance-level control, making fine-grained intra-utterance expression challenging due to their reliance on non-public datasets or complex multi-stage training. In this paper, we propose a training-free controllable framework for pretrained zero-shot TTS to enable intra-utterance emotion and duration expression. Specifically, we propose a segment-aware emotion conditioning strategy that combines causal masking with monotonic stream alignment filtering to isolate emotion conditioning and schedule mask transitions, enabling smooth intra-utterance emotion shifts while preserving global semantic coherence. Based on this, we further propose a segment-aware duration steering strategy to combine local duration embedding steering with global EOS logit modulation, allowing local duration adjustment while ensuring globally consistent termination. To eliminate the need for segment-level manual prompt engineering, we construct a 30,000-sample multi-emotion and duration-annotated text dataset to enable LLM-based automatic prompt construction. Extensive experiments demonstrate that our training-free method not only achieves state-of-the-art intra-utterance consistency in multi-emotion and duration control, but also maintains baseline-level speech quality of the underlying TTS model. Audio samples are available at https://aclanonymous111.github.io/TED-TTS-DemoPage/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a training-free framework for intra-utterance emotion and duration control in TTS, which significantly advances the state of the art in controllable speech synthesis. The innovative methodologies and extensive evaluation demonstrate the potential for practical applications, although challenges in reproducibility and generalization remain.
The proposed methodology introduces a segment-aware emotion conditioning strategy that innovatively combines causal masking with monotonic stream alignment filtering. This approach effectively isolates emotion conditioning and manages mask transitions, which is a novel contribution to the field of TTS. The segment-aware duration steering strategy further enhances the model's capability by allowing local duration adjustments while ensuring global consistency. However, the reliance on a large annotated dataset for automatic prompt construction raises questions about the generalizability of the approach in low-resource scenarios.
The experiments are extensive and demonstrate the effectiveness of the proposed methods in achieving state-of-the-art performance in intra-utterance emotion and duration control. The authors provide audio samples that allow for qualitative assessment of the results, which is a strong point. However, the paper could benefit from more quantitative comparisons with existing methods to better highlight the improvements achieved.
The paper does not provide detailed implementation specifics, which could hinder reproducibility. While the dataset is mentioned, the lack of a public repository or code sharing limits the ability for other researchers to replicate the results. Clearer documentation and availability of code would enhance reproducibility.
The main limitations include the potential overfitting to the constructed dataset, as it may not generalize well to other datasets or real-world applications. Additionally, the complexity of the proposed methods may pose challenges in practical deployment, especially in low-latency scenarios.
This research has significant implications for the development of controllable TTS systems, particularly in applications requiring nuanced emotional expression, such as virtual assistants, audiobooks, and entertainment. The ability to control emotion and duration within utterances can enhance user experience and engagement, making this work relevant to both academic and commercial sectors. The main contribution of this paper is the introduction of a training-free framework for intra-utterance emotion and duration control in TTS, which significantly advances the state of the art in controllable speech synthesis. The innovative methodologies and extensive evaluation demonstrate the potential for practical applications, although challenges in reproducibility and generalization remain.
The rapid advancement of speech synthesis technologies, including text-to-speech (TTS) and voice conversion (VC), has intensified security and privacy concerns related to voice cloning. Recent defenses attempt to prevent unauthorized cloning by embedding protective perturbations into speech to obscure speaker identity while maintaining intelligibility. However, adversaries can apply advanced purification techniques to remove these perturbations, recover authentic acoustic characteristics, and regenerate cloneable voices. Despite the growing realism of such attacks, the robustness of existing defenses under adaptive purification remains insufficiently studied. Most existing purification methods are designed to counter adversarial noise in automatic speech recognition (ASR) systems rather than speaker verification or voice cloning pipelines. As a result, they fail to suppress the fine-grained acoustic cues that define speaker identity and are often ineffective against speaker verification attacks (SVA). To address these limitations, we propose Diffusion-Bridge (VocalBridge), a purification framework that learns a latent mapping from perturbed to clean speech in the EnCodec latent space. Using a time-conditioned 1D U-Net with a cosine noise schedule, the model enables efficient, transcript-free purification while preserving speaker-discriminative structure. We further introduce a Whisper-guided phoneme variant that incorporates lightweight temporal guidance without requiring ground-truth transcripts. Experimental results show that our approach consistently outperforms existing purification methods in recovering cloneable voices from protected speech. Our findings demonstrate the fragility of current perturbation-based defenses and highlight the need for more robust protection mechanisms against evolving voice-cloning and speaker verification threats.
Primary: University of Texas at San Antonio
All Institutions: University of Texas at San Antonio
The main contribution of this paper is the introduction of VocalBridge, a novel purification framework that effectively addresses the challenges posed by perturbation-based defenses in voice cloning. This work significantly advances the understanding of vulnerabilities in voice synthesis technologies and proposes a sophisticated method to counteract these threats.
The paper introduces a novel purification framework called VocalBridge that utilizes a latent mapping approach to address the limitations of existing perturbation-based defenses against voice cloning. The use of a time-conditioned 1D U-Net with a cosine noise schedule is innovative, allowing for efficient purification without the need for transcripts. The incorporation of a Whisper-guided phoneme variant adds a layer of sophistication by leveraging temporal guidance, which is a notable advancement in the field.
The experimental results presented in the paper demonstrate that VocalBridge consistently outperforms existing purification methods. However, the paper could benefit from a more extensive evaluation across diverse datasets and real-world scenarios to validate the robustness of the proposed method. The metrics used for evaluation are not detailed in the abstract, which could limit the understanding of the comparative performance.
The paper lacks sufficient implementation details that would facilitate reproducibility. Key aspects such as hyperparameter settings, training procedures, and the datasets used for evaluation are not provided in the abstract. This omission could hinder other researchers from replicating the results.
One significant limitation is the focus on a specific type of perturbation-based defense, which may not encompass all possible variations in voice cloning defenses. Additionally, the reliance on a single architecture (1D U-Net) may limit the generalizability of the findings. The paper does not address potential computational costs or the scalability of the proposed method in real-time applications.
The implications of this research are substantial, as it highlights vulnerabilities in current voice cloning defenses and underscores the need for more robust security measures in voice authentication systems. The findings could influence future research directions in both voice synthesis technologies and security protocols. The main contribution of this paper is the introduction of VocalBridge, a novel purification framework that effectively addresses the challenges posed by perturbation-based defenses in voice cloning. This work significantly advances the understanding of vulnerabilities in voice synthesis technologies and proposes a sophisticated method to counteract these threats.
Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of MM-Sonate, a unified framework for multimodal audio-video generation that achieves state-of-the-art performance and introduces innovative techniques for zero-shot voice cloning. This paper significantly advances the field by addressing critical challenges in audio-video synchronization and speaker identity preservation, paving the way for more sophisticated generative models.
The proposed methodology, MM-Sonate, introduces a unified multimodal flow-matching framework that integrates audio-video generation with zero-shot voice cloning capabilities. The use of a unified instruction-phoneme input format is innovative, allowing for precise synchronization and control over the generated outputs. The timbre injection mechanism effectively decouples speaker identity from linguistic content, which is a significant advancement over existing models that struggle with this aspect. The introduction of a noise-based negative conditioning strategy enhances acoustic fidelity, addressing limitations in traditional classifier-free guidance approaches. Overall, the methodology is well-structured and addresses key challenges in the field.
The empirical evaluations are robust, demonstrating that MM-Sonate achieves state-of-the-art performance across various benchmarks, particularly in lip synchronization and speech intelligibility. The paper provides comprehensive comparisons against existing models, showcasing significant improvements in both objective metrics and human preference evaluations. The extensive dataset preparation, including a high-fidelity synthetic dataset and a large-scale multimodal pre-training corpus, supports the model's generalization capabilities across diverse generation tasks.
The paper includes detailed implementation details, including architecture specifications, training strategies, and evaluation metrics. However, the lack of a publicly available code repository or demo limits reproducibility. The methodology is described in sufficient detail for other researchers to replicate the experiments, but access to the actual model and datasets would enhance reproducibility further.
The paper acknowledges several limitations, including challenges in generating long-form content and potential loss of high-frequency details due to compression. Additionally, the model may struggle with extreme head poses or occlusions, leading to synchronization issues. Ethical concerns regarding the misuse of voice cloning technology are also highlighted, emphasizing the need for responsible deployment.
MM-Sonate has significant potential applications in various domains, including entertainment, education, and virtual reality, where personalized audio-visual content is valuable. However, the ethical implications of voice cloning and the potential for misuse in creating deepfakes and misinformation must be carefully managed. The authors propose mitigation strategies, such as embedding watermarks in generated content, which is a proactive approach to addressing these concerns. The main contribution of this work is the introduction of MM-Sonate, a unified framework for multimodal audio-video generation that achieves state-of-the-art performance and introduces innovative techniques for zero-shot voice cloning. This paper significantly advances the field by addressing critical challenges in audio-video synchronization and speaker identity preservation, paving the way for more sophisticated generative models.
Existing fraud detection methods predominantly rely on transcribed text, suffering from ASR errors and missing crucial acoustic cues like vocal tone and environmental context. This limits their effectiveness against complex deceptive strategies. To address these challenges, we first propose \textbf{SAFE-QAQ}, an end-to-end comprehensive framework for audio-based slow-thinking fraud detection. First, the SAFE-QAQ framework eliminates the impact of transcription errors on detection performance. Secondly, we propose rule-based slow-thinking reward mechanisms that systematically guide the system to identify fraud-indicative patterns by accurately capturing fine-grained audio details, through hierarchical reasoning processes. Besides, our framework introduces a dynamic risk assessment framework during live calls, enabling early detection and prevention of fraud. Experiments on the TeleAntiFraud-Bench demonstrate that SAFE-QAQ achieves dramatic improvements over existing methods in multiple key dimensions, including accuracy, inference efficiency, and real-time processing capabilities. Currently deployed and analyzing over 70,000 calls daily, SAFE-QAQ effectively automates complex fraud detection, reducing human workload and financial losses. Code: https://anonymous.4open.science/r/SAFE-QAQ.
Primary: Shanghai University of Electric Power
All Institutions: Northeastern University, China Mobile Internet Company Ltd, Shanghai University of Electric Power, Peking University
The main contribution of this paper is the development of the SAFE-QAQ framework, which significantly enhances audio-based fraud detection by integrating reinforcement learning with slow-thinking processes, thereby addressing the limitations of traditional ASR-based methods. The technical contributions are substantial, with a well-defined methodology and rigorous experimental validation that positions this work as a notable advancement in the field of machine learning for fraud detection.
The methodology is robust, introducing the SAFE-QAQ framework that integrates reinforcement learning with slow-thinking mechanisms to enhance audio-text fraud detection. The use of rule-based rewards and dynamic risk assessment during live calls is innovative, allowing the model to capture nuanced audio features that traditional ASR-based systems miss. The hierarchical reasoning process is well-structured, and the transition from raw audio processing to real-time fraud detection is a significant advancement in the field.
The experiments are comprehensive, utilizing the TeleAntiFraud-Bench dataset to demonstrate the effectiveness of the SAFE-QAQ framework. The reported improvements in accuracy, efficiency, and real-time processing capabilities are substantial, with detailed comparisons against various baseline models. The results indicate a clear performance hierarchy, showcasing the advantages of the proposed approach over existing methods.
The paper provides detailed implementation details, including hyperparameter settings and the computational environment used for experiments. However, the lack of a publicly accessible dataset for broader testing may hinder full reproducibility. The code repository link is provided, which aids in replicating the results.
One notable limitation is the reliance on the TeleAntiFraud-28k dataset, which may not encompass the full diversity of real-world fraud scenarios. This could restrict the generalization of the model's effectiveness across varied acoustic conditions. Additionally, the paper acknowledges potential algorithmic biases and emphasizes the need for continuous monitoring in practical applications.
The SAFE-QAQ framework has significant implications for the telecom industry, offering a more effective tool for fraud detection that can reduce financial losses and human workload. Its deployment in real-time systems could enhance the security of telecommunications, making it a valuable contribution to both academia and industry. The main contribution of this paper is the development of the SAFE-QAQ framework, which significantly enhances audio-based fraud detection by integrating reinforcement learning with slow-thinking processes, thereby addressing the limitations of traditional ASR-based methods. The technical contributions are substantial, with a well-defined methodology and rigorous experimental validation that positions this work as a notable advancement in the field of machine learning for fraud detection.
Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync's superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University
The main contribution of this paper is the introduction of the SmoothSync framework, which utilizes a dual-stream diffusion transformer architecture to generate high-quality, beat-synchronized co-speech gestures while addressing critical limitations in existing methods. The comprehensive evaluation of the proposed methodology demonstrates its effectiveness and potential impact on the field of gesture generation and human-computer interaction.
The proposed SmoothSync framework introduces a dual-stream diffusion transformer architecture that effectively synchronizes audio and motion features, addressing key limitations in existing gesture generation methods. The methodology is innovative, incorporating a jitter-suppression loss and probabilistic audio quantization to enhance gesture diversity and temporal smoothness. The dual-stream architecture allows for modality-specific processing, which is a significant advancement over previous methods that either concatenate or process modalities independently. The introduction of the Smooth-BC metric for evaluating rhythmic alignment further strengthens the methodology by providing a more reliable assessment of synchronization under jitter conditions.
The experiments conducted on the BEAT2 and SHOW datasets demonstrate the effectiveness of SmoothSync, achieving state-of-the-art results across multiple metrics, including Fréchet Gesture Distance (FGD), Smooth-BC, and diversity measures. The quantitative results indicate substantial improvements in motion quality and rhythmic alignment, with detailed comparisons to existing methods. The paper provides comprehensive evaluations, including both qualitative and quantitative analyses, showcasing the model's robustness and generalization capabilities across different datasets.
The paper mentions that the code will be released to facilitate future research, which is a positive aspect for reproducibility. However, specific implementation details, such as hyperparameters and training configurations, are provided, allowing for a clearer understanding of the model's training process. The use of standard datasets and metrics also supports reproducibility, although the absence of a direct URL for the code repository limits immediate access.
While the paper presents significant advancements, it does not thoroughly address potential limitations, such as the scalability of the model to larger datasets or real-time applications beyond the evaluated scenarios. Additionally, the reliance on specific datasets may limit the generalizability of the findings. The paper could benefit from a more explicit discussion of the computational requirements and potential trade-offs in model complexity versus performance.
The SmoothSync framework has the potential to significantly enhance applications in virtual avatars, embodied AI systems, and other areas requiring synchronized gesture generation. By improving the quality and diversity of generated gestures, this research could contribute to more natural human-computer interactions and improve user experience in various applications, including gaming, virtual reality, and telecommunication. The main contribution of this paper is the introduction of the SmoothSync framework, which utilizes a dual-stream diffusion transformer architecture to generate high-quality, beat-synchronized co-speech gestures while addressing critical limitations in existing methods. The comprehensive evaluation of the proposed methodology demonstrates its effectiveness and potential impact on the field of gesture generation and human-computer interaction.
The development of audio foundation models has accelerated rapidly since the emergence of GPT-4o. However, the lack of comprehensive evaluation has become a critical bottleneck for further progress in the field, particularly in audio generation. Current audio evaluation faces three major challenges: (1) audio evaluation lacks a unified framework, with datasets and code scattered across various sources, hindering fair and efficient cross-model comparison;(2) audio codecs, as a key component of audio foundation models, lack a widely accepted and holistic evaluation methodology; (3) existing speech benchmarks are heavily reliant on English, making it challenging to objectively assess models' performance on Chinese. To address the first issue, we introduce UltraEval-Audio, a unified evaluation framework for audio foundation models, specifically designed for both audio understanding and generation tasks. UltraEval-Audio features a modular architecture, supporting 10 languages and 14 core task categories, while seamlessly integrating 24 mainstream models and 36 authoritative benchmarks. To enhance research efficiency, the framework provides a one-command evaluation feature, accompanied by real-time public leaderboards. For the second challenge, UltraEval-Audio adopts a novel comprehensive evaluation scheme for audio codecs, evaluating performance across three key dimensions: semantic accuracy, timbre fidelity, and acoustic quality. To address the third issue, we propose two new Chinese benchmarks, SpeechCMMLU and SpeechHSK, designed to assess Chinese knowledge proficiency and language fluency. We wish that UltraEval-Audio will provide both academia and industry with a transparent, efficient, and fair platform for comparison of audio models. Our code, benchmarks, and leaderboards are available at https://github.com/OpenBMB/UltraEval-Audio.
Primary: OpenBMB
All Institutions: OpenBMB, OpenAI, Alibaba, Google, Moonshot, Xiaomi
The main contribution of this paper is the introduction of UltraEval-Audio, a unified framework for evaluating audio foundation models that addresses critical evaluation challenges and enhances research efficiency. This work is significant as it not only fills existing gaps in the evaluation landscape but also sets a foundation for future advancements in audio processing and understanding.
The proposed methodology introduces a comprehensive and modular evaluation framework that addresses critical gaps in the evaluation of audio foundation models. It effectively integrates multiple languages and tasks while providing a systematic approach to codec evaluation. The framework's design allows for easy adaptation and extensibility, which is crucial for future research.
The experiments conducted demonstrate the framework's capability to evaluate a diverse set of audio models and codecs across multiple benchmarks. The inclusion of new Chinese benchmarks adds significant value, addressing a notable gap in existing evaluations. The results are presented clearly, allowing for meaningful comparisons across models.
The paper provides a clear description of the framework's architecture and evaluation processes, along with a publicly accessible code repository. This enhances reproducibility, as other researchers can replicate the evaluation setup and results. However, some specifics about the experimental setup could be elaborated for complete clarity.
While the framework is comprehensive, it may still face challenges in evaluating models that operate in less common languages or dialects. Additionally, the reliance on existing benchmarks may limit the novelty of some evaluation metrics.
The UltraEval-Audio framework has the potential to significantly advance the field of audio foundation models by providing a standardized evaluation methodology. This can lead to improved model development and performance assessment, particularly in multilingual contexts. The main contribution of this paper is the introduction of UltraEval-Audio, a unified framework for evaluating audio foundation models that addresses critical evaluation challenges and enhances research efficiency. This work is significant as it not only fills existing gaps in the evaluation landscape but also sets a foundation for future advancements in audio processing and understanding.
We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.
Primary: unknown
All Institutions: unknown
The paper presents the LEMAS-Dataset, a large-scale multilingual speech corpus, and demonstrates its application in training advanced generative speech models. The contributions are significant, particularly in addressing multilingual synthesis challenges and advancing the state of the art in speech generation.
The methodology presented in the paper is robust, featuring a well-structured data processing pipeline that emphasizes high-quality annotations and extensive multilingual coverage. The use of a non-autoregressive flow-matching framework for LEMAS-TTS is innovative, particularly in its application to zero-shot multilingual synthesis. The accent-adversarial training and CTC loss techniques are commendable for addressing cross-lingual accent issues, showcasing a thoughtful approach to enhancing synthesis stability. LEMAS-Edit's formulation of speech editing as a masked token infilling task is a novel contribution, leveraging precise word-level alignments effectively.
The experimental results presented are convincing, demonstrating that models trained on the LEMAS-Dataset achieve high-quality synthesis and editing performance. The validation of the dataset's effectiveness across diverse generative paradigms strengthens the paper's claims. However, the paper could benefit from more extensive comparative analyses with existing datasets and models to contextualize its contributions better.
The paper lacks detailed implementation specifics, such as hyperparameter settings and training configurations, which are crucial for reproducibility. While the results are promising, the absence of a clear reproducibility framework may hinder other researchers from replicating the findings effectively.
One limitation is the potential bias in the dataset, given that it covers only 10 major languages, which may not represent the full linguistic diversity of the global population. Additionally, the scalability of the proposed methods to other languages or dialects remains untested. The focus on accent issues, while important, may not address other aspects of multilingual synthesis that could affect performance.
The LEMAS-Dataset has the potential to significantly advance the field of multilingual speech synthesis and editing, particularly in applications requiring high-quality, diverse linguistic outputs. Its open-source nature encourages further research and development in prompt-based speech generation systems, potentially leading to more inclusive and accessible technology. The paper presents the LEMAS-Dataset, a large-scale multilingual speech corpus, and demonstrates its application in training advanced generative speech models. The contributions are significant, particularly in addressing multilingual synthesis challenges and advancing the state of the art in speech generation.
Instruct Text-to-Speech (InstructTTS) leverages natural language descriptions as style prompts to guide speech synthesis. However, existing InstructTTS methods mainly rely on a direct combination of audio-related labels or their diverse rephrasings, making it difficult to handle flexible, high-level instructions. Such rigid control is insufficient for users such as content creators who wish to steer generation with descriptive instructions. To address these constraints, we introduce OV-InstructTTS, a new paradigm for open-vocabulary InstructTTS. We propose a comprehensive solution comprising a newly curated dataset, OV-Speech, and a novel reasoning-driven framework. The OV-Speech dataset pairs speech with open-vocabulary instructions, each augmented with a reasoning process that connects high-level instructions to acoustic features. The reasoning-driven framework infers emotional, acoustic, and paralinguistic information from open-vocabulary instructions before synthesizing speech. Evaluations show that this reasoning-driven approach significantly improves instruction-following fidelity and speech expressiveness. We believe this work can inspire the next user-friendly InstructTTS systems with stronger generalization and real-world applicability. The dataset and demos are publicly available on our project page.
Primary: unknown
All Institutions: unknown
The paper introduces OV-InstructTTS, a novel paradigm that enhances InstructTTS systems by enabling flexible speech synthesis from open-vocabulary instructions through a reasoning-driven framework. This work significantly advances the state of the art in TTS, providing a comprehensive solution that combines a new dataset and innovative methodology, with promising results that could influence future research and applications in the field.
The methodology presented in this paper is robust, introducing a novel reasoning-driven framework that effectively connects high-level open-vocabulary instructions to low-level acoustic features. The construction of the OV-Speech dataset is methodical, emphasizing the importance of contextual information and reasoning chains, which are critical for improving the expressiveness and fidelity of synthesized speech. The use of large language models (LLMs) to generate instructions and reasoning chains demonstrates an innovative approach to bridging the semantic gap in TTS systems.
The experimental evaluation is thorough, utilizing both objective and subjective metrics to assess the performance of the proposed model against several strong baselines. The results indicate significant improvements in instruction following and speech naturalness, which are well-supported by the ablation studies. The comprehensive evaluation methodology enhances the credibility of the findings.
The paper provides sufficient implementation details, including model architecture, training parameters, and dataset partitioning, which facilitate reproducibility. However, the lack of specific information about the datasets used for baseline comparisons could hinder full reproducibility for those models.
One limitation is the reliance on LLMs for generating reasoning chains and instructions, which may introduce biases or inaccuracies depending on the model's training data. Additionally, the dataset's quality and diversity may affect the generalizability of the results. The paper does not address potential ethical concerns related to the use of AI-generated speech in sensitive applications.
This research has significant implications for the field of TTS, particularly in enhancing user control and expressiveness in speech synthesis. The proposed system could benefit various applications, including content creation, virtual assistants, and accessibility tools, making TTS more intuitive and user-friendly. The open-vocabulary approach could lead to more personalized and engaging interactions between users and machines. The paper introduces OV-InstructTTS, a novel paradigm that enhances InstructTTS systems by enabling flexible speech synthesis from open-vocabulary instructions through a reasoning-driven framework. This work significantly advances the state of the art in TTS, providing a comprehensive solution that combines a new dataset and innovative methodology, with promising results that could influence future research and applications in the field.
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MOSS Transcribe Diarize, a novel end-to-end model for SATS that significantly enhances the accuracy and reliability of speaker-attributed, time-stamped transcriptions in long-form audio contexts. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing critical limitations of existing systems.
The paper introduces MOSS Transcribe Diarize, a unified multimodal large language model that performs Speaker-Attributed, Time-Stamped Transcription (SATS) in an end-to-end manner. The methodology is robust, leveraging a 128k-token context window to process long-form audio without chunking, which is a significant improvement over existing systems. The architecture combines an audio encoder with a projection module to align speaker identities with lexical content effectively. The explicit representation of temporal information as formatted timestamp text is a novel approach that enhances the accuracy of timestamp generation.
The experimental setup is comprehensive, utilizing diverse datasets such as AISHELL-4, Podcast, and Movies to evaluate the model's performance across various real-world scenarios. The results indicate that MOSS Transcribe Diarize consistently outperforms state-of-the-art commercial systems in terms of Character Error Rate (CER) and concatenated minimum-permutation CER (cpCER), demonstrating the effectiveness of the proposed end-to-end SATS formulation.
The paper lacks specific details regarding the implementation and availability of the model, which may hinder reproducibility. While it mentions that datasets will be open-sourced, the absence of a clear project or code repository limits the ability for other researchers to replicate the findings.
One limitation is the reliance on simulated data to augment training, which may not fully capture the complexities of real-world audio. Additionally, the paper does not address potential biases in the training data or the model's performance across different languages and accents beyond those mentioned.
The implications of this research are significant for applications in meeting transcription, legal discovery, and assistive technologies, where accurate speaker attribution and timing are crucial. The advancements presented could lead to improved accessibility and efficiency in processing multi-speaker conversations. The main contribution of this paper is the introduction of MOSS Transcribe Diarize, a novel end-to-end model for SATS that significantly enhances the accuracy and reliability of speaker-attributed, time-stamped transcriptions in long-form audio contexts. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing critical limitations of existing systems.