Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.
Primary: MIT
All Institutions: MIT
The paper introduces WhispSynth, a novel framework for generating high-fidelity whispered speech, addressing critical data scarcity issues in whisper research and significantly advancing the state of the art in TTS systems. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its importance in the field.
The methodology presented in the paper is innovative, particularly in the integration of Differentiable Digital Signal Processing (DDSP) with Text-to-Speech (TTS) models to create a pitch-free whisper generation framework. The authors effectively address the challenges of whisper synthesis by developing a robust pipeline that combines existing datasets with their newly constructed WhispNJU dataset. The use of adversarial training and semi-supervised dual-focus training strategies enhances the model's ability to generate high-fidelity whispered speech, demonstrating a thoughtful approach to overcoming limitations in existing TTS systems.
The experimental evaluation is comprehensive, with a clear focus on both subjective and objective metrics to assess the quality of the synthesized whispers. The authors provide detailed comparisons with existing datasets and methods, showcasing significant improvements in naturalness and intelligibility. The use of metrics such as DNSMOS and UTMOS, along with rigorous testing across multiple languages, strengthens the validity of their findings. However, the paper could benefit from more extensive ablation studies to further clarify the contributions of each component in their proposed framework.
The paper includes a link to the official implementation on GitHub, which is crucial for reproducibility. The authors provide sufficient details about their training and evaluation processes, including dataset splits and training settings. However, the paper could improve by including more specific hyperparameter settings and a clearer description of the data preprocessing steps to facilitate easier replication of their results.
The authors acknowledge the limitations related to the influence of non-linguistic variations on model performance and the potential impact of the audio watermarking on synthesized audio quality. Additionally, the dataset's reliance on existing corpora may introduce biases that could affect the generalizability of the findings. The lack of a systematic assessment of hardware differences in real-world applications is another notable limitation.
The work has significant implications for the fields of speech synthesis and audio processing, particularly in applications requiring whisper generation, such as ASMR content creation and secure communication systems. By providing an open-source resource and framework for whisper synthesis, the authors contribute to advancing research in this niche area, potentially enabling further developments in multilingual speech technologies. The paper introduces WhispSynth, a novel framework for generating high-fidelity whispered speech, addressing critical data scarcity issues in whisper research and significantly advancing the state of the art in TTS systems. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its importance in the field.
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}
Primary: Peking University
All Institutions: JD Explore Academy, Fudan University, Peking University, The University of Hong Kong
The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
The proposed OmniForcing framework is a significant advancement in real-time joint audio-visual generation, addressing the latency issues of existing models through innovative techniques such as Asymmetric Block-Causal Alignment and Audio Sink Tokens with Identity RoPE. The methodology is well-structured, with a clear focus on overcoming the challenges of temporal asymmetry and training instability in dual-stream architectures. The introduction of a Joint Self-Forcing Distillation paradigm is particularly noteworthy, as it allows the model to dynamically correct cross-modal errors, enhancing the robustness of the generation process.
The experiments are comprehensive, comparing OmniForcing against both bidirectional models and cascaded autoregressive baselines. The evaluation metrics are well-defined, focusing on visual quality, audio fidelity, and real-time inference efficiency. The results demonstrate that OmniForcing achieves state-of-the-art performance, significantly reducing latency while maintaining high-quality outputs, which is crucial for real-time applications.
The paper provides detailed implementation details, including training setups and hyperparameters, which enhances reproducibility. However, the lack of a publicly available code repository or demo may hinder independent verification of results.
One limitation is the inherent trade-off between streaming capability and the full-sequence attention of the original bidirectional model, which may lead to slight reductions in consistency and synchrony compared to the teacher model. Additionally, the reliance on a specific architecture (LTX-2) may limit the generalizability of the findings to other models.
The work has significant implications for real-time applications in multimedia content creation, gaming, and interactive media, where low-latency audio-visual generation is essential. By enabling efficient streaming of synchronized audio and video, OmniForcing could facilitate advancements in various fields, including virtual reality and live performance technologies. The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel reward modeling framework, SDiaReward, which significantly improves the evaluation of spoken dialogue systems by addressing modality and colloquialness gaps through a data-driven approach. This work is pivotal in advancing the state-of-the-art in dialogue systems, providing a comprehensive methodology and robust experimental validation that could influence future research and applications in the field.
The paper introduces a novel reward modeling framework, SDiaReward, which effectively addresses the modality and colloquialness gaps in spoken dialogue systems. The methodology is well-structured, utilizing a pairwise preference learning approach to train on a specifically curated dataset (SDiaReward-Dataset) that captures the nuances of natural speech. The integration of multimodal LLMs for reward prediction, along with the establishment of ESDR-Bench for benchmarking, showcases a comprehensive approach to improving dialogue evaluation metrics.
The experimental evaluation is robust, demonstrating the superiority of SDiaReward over existing general-purpose audio LLMs through a series of well-defined metrics, including pairwise preference accuracy across various datasets. The results indicate significant improvements in capturing paralinguistic features and conversational spontaneity, which are critical for realistic spoken dialogue systems. The use of both micro and macro averages for accuracy assessment provides a nuanced understanding of model performance across different data regimes.
The paper includes detailed implementation specifics, including model architecture, training procedures, and dataset construction methods. The availability of code and data enhances reproducibility, although the actual primary institution is not specified, which could limit the ability to verify institutional affiliations and resources.
The paper acknowledges limitations related to the dataset's focus on "in-the-wild" recordings, which may affect the model's robustness in more controlled environments. Additionally, the potential for domain-specific biases in reward scoring is noted, suggesting that further refinements are necessary for broader applicability.
The work has significant implications for the development of more natural and effective spoken dialogue systems, which can enhance human-AI interactions across various applications, including customer service, education, and entertainment. By addressing critical gaps in existing models, this research paves the way for future advancements in dialogue systems that can better understand and generate human-like speech. The main contribution of this paper is the introduction of a novel reward modeling framework, SDiaReward, which significantly improves the evaluation of spoken dialogue systems by addressing modality and colloquialness gaps through a data-driven approach. This work is pivotal in advancing the state-of-the-art in dialogue systems, providing a comprehensive methodology and robust experimental validation that could influence future research and applications in the field.
Whisper generation is constrained by the difficulty of data collection. Because whispered speech has low acoustic amplitude, high-fidelity recording is challenging. In this paper, we introduce WhispSynth, a large-scale multilingual corpus constructed via a novel high-fidelity generative framework. Specifically, we propose a pipeline integrating Differentiable Digital Signal Processing (DDSP)-based pitch-free method with Text-to-Speech (TTS) models. This framework refines a comprehensive collection of resources, including our newly constructed WhispNJU dataset, into 118 hours of high-fidelity whispered speech from 479 speakers. Unlike standard synthetic or noisy real data, our data engine faithfully preserves source vocal timbre and linguistic content while ensuring acoustic consistency, providing a robust foundation for text-to-whisper research. Experimental results demonstrate that WhispSynth exhibits significantly higher quality than existing corpora. Moreover, our CosyWhisper, tuned with WhispSynth, achieves speech naturalness on par with ground-truth samples. The official implementation and related resources are available at https://github.com/tan90xx/cosywhisper.
Primary: MIT
All Institutions: MIT
The paper introduces WhispSynth, a novel framework for generating high-fidelity whispered speech, addressing critical data scarcity issues in whisper research and significantly advancing the state of the art in TTS systems. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its importance in the field.
The methodology presented in the paper is innovative, particularly in the integration of Differentiable Digital Signal Processing (DDSP) with Text-to-Speech (TTS) models to create a pitch-free whisper generation framework. The authors effectively address the challenges of whisper synthesis by developing a robust pipeline that combines existing datasets with their newly constructed WhispNJU dataset. The use of adversarial training and semi-supervised dual-focus training strategies enhances the model's ability to generate high-fidelity whispered speech, demonstrating a thoughtful approach to overcoming limitations in existing TTS systems.
The experimental evaluation is comprehensive, with a clear focus on both subjective and objective metrics to assess the quality of the synthesized whispers. The authors provide detailed comparisons with existing datasets and methods, showcasing significant improvements in naturalness and intelligibility. The use of metrics such as DNSMOS and UTMOS, along with rigorous testing across multiple languages, strengthens the validity of their findings. However, the paper could benefit from more extensive ablation studies to further clarify the contributions of each component in their proposed framework.
The paper includes a link to the official implementation on GitHub, which is crucial for reproducibility. The authors provide sufficient details about their training and evaluation processes, including dataset splits and training settings. However, the paper could improve by including more specific hyperparameter settings and a clearer description of the data preprocessing steps to facilitate easier replication of their results.
The authors acknowledge the limitations related to the influence of non-linguistic variations on model performance and the potential impact of the audio watermarking on synthesized audio quality. Additionally, the dataset's reliance on existing corpora may introduce biases that could affect the generalizability of the findings. The lack of a systematic assessment of hardware differences in real-world applications is another notable limitation.
The work has significant implications for the fields of speech synthesis and audio processing, particularly in applications requiring whisper generation, such as ASMR content creation and secure communication systems. By providing an open-source resource and framework for whisper synthesis, the authors contribute to advancing research in this niche area, potentially enabling further developments in multilingual speech technologies. The paper introduces WhispSynth, a novel framework for generating high-fidelity whispered speech, addressing critical data scarcity issues in whisper research and significantly advancing the state of the art in TTS systems. The comprehensive methodology, rigorous experimental evaluation, and potential for broader applications underscore its importance in the field.
The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel reward modeling framework, SDiaReward, which significantly improves the evaluation of spoken dialogue systems by addressing modality and colloquialness gaps through a data-driven approach. This work is pivotal in advancing the state-of-the-art in dialogue systems, providing a comprehensive methodology and robust experimental validation that could influence future research and applications in the field.
The paper introduces a novel reward modeling framework, SDiaReward, which effectively addresses the modality and colloquialness gaps in spoken dialogue systems. The methodology is well-structured, utilizing a pairwise preference learning approach to train on a specifically curated dataset (SDiaReward-Dataset) that captures the nuances of natural speech. The integration of multimodal LLMs for reward prediction, along with the establishment of ESDR-Bench for benchmarking, showcases a comprehensive approach to improving dialogue evaluation metrics.
The experimental evaluation is robust, demonstrating the superiority of SDiaReward over existing general-purpose audio LLMs through a series of well-defined metrics, including pairwise preference accuracy across various datasets. The results indicate significant improvements in capturing paralinguistic features and conversational spontaneity, which are critical for realistic spoken dialogue systems. The use of both micro and macro averages for accuracy assessment provides a nuanced understanding of model performance across different data regimes.
The paper includes detailed implementation specifics, including model architecture, training procedures, and dataset construction methods. The availability of code and data enhances reproducibility, although the actual primary institution is not specified, which could limit the ability to verify institutional affiliations and resources.
The paper acknowledges limitations related to the dataset's focus on "in-the-wild" recordings, which may affect the model's robustness in more controlled environments. Additionally, the potential for domain-specific biases in reward scoring is noted, suggesting that further refinements are necessary for broader applicability.
The work has significant implications for the development of more natural and effective spoken dialogue systems, which can enhance human-AI interactions across various applications, including customer service, education, and entertainment. By addressing critical gaps in existing models, this research paves the way for future advancements in dialogue systems that can better understand and generate human-like speech. The main contribution of this paper is the introduction of a novel reward modeling framework, SDiaReward, which significantly improves the evaluation of spoken dialogue systems by addressing modality and colloquialness gaps through a data-driven approach. This work is pivotal in advancing the state-of-the-art in dialogue systems, providing a comprehensive methodology and robust experimental validation that could influence future research and applications in the field.
Target speech extraction (TSE) aims to recover a target speaker's voice from a mixture. While recent text-prompted approaches have shown promise, most approaches assume fully overlapped mixtures, limiting insight into behavior across realistic overlap ratios. We introduce VorTEX (Various overlap ratio for Target speech EXtraction), a text-prompted TSE architecture with a Decoupled Adaptive Multi-branch (DAM) Fusion block that separates primary extraction from auxiliary regularization pathways. To enable controlled analysis, we construct PORTE, a two-speaker dataset spanning overlap ratios from 0% to 100%. We further propose Suppression Ratio on Energy (SuRE), a diagnostic metric that detects suppression behavior not captured by conventional measures. Experiments show that existing models exhibit suppression or residual interference under overlap, whereas VorTEX achieves the highest separation fidelity across 20-100% overlap (e.g., 5.50 dB at 20% and 2.04 dB at 100%) while maintaining zero SuRE, indicating robust extraction without suppression-driven artifacts.
Primary: Chung-Ang University
All Institutions: Chung-Ang University
The main contribution of this work is the introduction of VorTEX, a robust text-prompted TSE model that effectively addresses the challenges of varying overlap ratios in speech mixtures, alongside the creation of the PORTE dataset for evaluating TSE performance. This research significantly advances the field of audio processing by providing a novel architecture and evaluation framework that can lead to improved speech extraction in practical applications.
The paper introduces VorTEX, a novel architecture for target speech extraction (TSE) that utilizes a Decoupled Adaptive Multi-branch (DAM) Fusion block to separate extraction and regularization pathways. This approach is innovative as it addresses the limitations of existing models that primarily focus on fully overlapped mixtures, thus enhancing the robustness of TSE across various overlap ratios. The proposed methodology is well-structured, with a clear explanation of the DAM architecture and its components, including Multi-Scale Fusion, Adaptive Fusion, and Dual Projection Fusion. The introduction of the PORTE dataset is a significant contribution, providing a controlled environment for evaluating TSE models under realistic conditions.
The experiments conducted are thorough, comparing VorTEX against established models in the field, such as AudioSep and DGMO, as well as other text-prompted TSE models like StyleTSE and LLM-TSE. The use of multiple evaluation metrics, including SISDR, PESQ, and the newly proposed SuRE metric, allows for a comprehensive assessment of model performance. The results demonstrate VorTEX's superior extraction fidelity and robustness, particularly in high-overlap scenarios, validating the effectiveness of the proposed architecture.
The paper provides sufficient details regarding the architecture, training configuration, and evaluation metrics, which would allow for reproducibility. However, the lack of a public repository or demo URL limits the ease of access for other researchers to replicate the results.
While the study presents significant advancements, it relies on a synthetic dataset (PORTE), which may not fully capture the complexities of real-world conversational audio. Additionally, the prompts used for TSE are limited to observable attributes, suggesting that future work could explore more complex prompt structures. The paper also acknowledges the need for further research to develop metrics that comprehensively assess extraction fidelity, perceptual quality, and speaker preservation.
The findings of this research have the potential to improve applications in speech recognition, assistive technologies, and audio processing systems where clear target speech extraction is crucial. By addressing the challenges of overlapping speech, VorTEX could enhance user experiences in various real-world scenarios, such as in crowded environments or during multi-speaker conversations. The main contribution of this work is the introduction of VorTEX, a robust text-prompted TSE model that effectively addresses the challenges of varying overlap ratios in speech mixtures, alongside the creation of the PORTE dataset for evaluating TSE performance. This research significantly advances the field of audio processing by providing a novel architecture and evaluation framework that can lead to improved speech extraction in practical applications.
Existing video-to-audio (V2A) generation methods predominantly rely on text prompts alongside visual information to synthesize audio. However, two critical bottlenecks persist: semantic granularity gaps in training data, such as conflating acoustically distinct sounds under coarse labels, and textual ambiguity in describing micro-acoustic features. These bottlenecks make it difficult to perform fine-grained sound synthesis using text-controlled modes. To address these limitations, we propose AC-Foley, an audio-conditioned V2A model that directly leverages reference audio to achieve precise and fine-grained control over generated sounds. This approach enables fine-grained sound synthesis, timbre transfer, zero-shot sound generation, and improved audio quality. By directly conditioning on audio signals, our approach bypasses the semantic ambiguities of text descriptions while enabling precise manipulation of acoustic attributes. Empirically, AC-Foley achieves state-of-the-art performance for Foley generation when conditioned on reference audio, while remaining competitive with state-of-the-art video-to-audio methods even without audio conditioning.
Primary: [Institution not explicitly stated in the text]
All Institutions: [Institution not explicitly stated in the text]
The main contribution of this paper is the introduction of AC-Foley, a novel audio-conditioned framework for video-to-audio generation that enables precise acoustic control through direct audio conditioning. This work significantly advances the state of the art in audio synthesis by addressing key challenges in fine-grained sound generation and multimodal integration, paving the way for innovative applications in creative sound design.
The methodology presented in AC-Foley is robust, leveraging a two-stage training framework that effectively addresses the challenges of temporal alignment and acoustic fidelity in video-to-audio synthesis. The integration of reference audio as a conditioning mechanism is a significant innovation, allowing for precise control over generated sounds and overcoming the limitations of text-based prompts. The use of multimodal transformers to unify video, audio, and text modalities is well-justified and enhances the model's performance. The paper also provides a clear explanation of the conditional flow matching objective and the audio control module, which are critical to the success of the proposed method.
The experimental evaluation is comprehensive, utilizing a variety of metrics to assess performance across several dimensions, including distribution matching, semantic alignment, temporal synchronization, and spectral fidelity. The results demonstrate that AC-Foley outperforms existing methods in multiple aspects, showcasing its effectiveness in generating high-quality audio that is temporally and semantically aligned with video content. The inclusion of human studies adds a valuable subjective evaluation component, further validating the model's performance.
The paper provides detailed implementation details, including training strategies, dataset descriptions, and evaluation metrics, which enhance reproducibility. However, the lack of a publicly available code repository or demo URL limits the ability of other researchers to replicate the findings directly.
The paper acknowledges limitations in handling complex auditory environments, particularly when multiple sound sources overlap or when there are extreme temporal mismatches between reference sounds and visual content. These factors may hinder the model's ability to generate optimal audio in certain scenarios.
The proposed AC-Foley framework has significant implications for the fields of sound design and multimedia content creation, enabling artists and creators to achieve precise audio synthesis that aligns closely with visual elements. Its potential applications extend to film, gaming, and virtual reality, where high-quality audio generation is crucial for immersive experiences. The main contribution of this paper is the introduction of AC-Foley, a novel audio-conditioned framework for video-to-audio generation that enables precise acoustic control through direct audio conditioning. This work significantly advances the state of the art in audio synthesis by addressing key challenges in fine-grained sound generation and multimodal integration, paving the way for innovative applications in creative sound design.
Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a robust multimodal framework for estimating emotional mimicry intensity that effectively integrates textual anchors to mitigate the impact of noisy signals. This work represents a meaningful advancement in the field of affective computing, particularly in its approach to handling real-world challenges in multimodal data processing.
The proposed TAEMI framework introduces a novel approach to emotional mimicry intensity estimation by leveraging a Text-Anchored Dual Cross-Attention mechanism. This method effectively addresses the challenges posed by noisy and missing data in multimodal inputs, which is a significant improvement over traditional symmetric fusion methods. The integration of Learnable Missing-Modality Tokens and Modality Dropout during training is particularly innovative, as it enhances the model's robustness in real-world scenarios. However, the paper could benefit from a more detailed explanation of the attention mechanism and how it specifically interacts with the different modalities.
The experiments conducted on the Hume-Vidmimic2 dataset are comprehensive, showcasing the framework's ability to capture fine-grained emotional variations. The reported state-of-the-art mean Pearson correlation coefficient across six emotional dimensions indicates strong performance. However, the paper lacks a thorough comparison with a wider range of baseline methods, which could provide a clearer context for the claimed improvements. Additionally, the absence of subjective evaluations or qualitative assessments of the model's outputs limits the understanding of its practical effectiveness.
The paper does not provide sufficient details regarding the implementation, hyperparameter settings, or the training process, which raises concerns about reproducibility. Including a clear methodology section with code availability or supplementary materials would significantly enhance the reproducibility of the results.
One notable limitation is the reliance on textual transcripts as anchors, which may not always be available or accurate in real-world applications. Additionally, while the model performs well under controlled conditions, its effectiveness in highly variable environments remains to be fully validated. The potential for overfitting to the training dataset is also a concern, particularly given the complexity of the model.
The implications of this research are significant for affective computing and applications in human-computer interaction, where understanding emotional states is crucial. The framework could be applied in various domains, including mental health monitoring, social robotics, and interactive entertainment. However, ethical considerations regarding data privacy and the potential for misuse in surveillance or manipulation should be addressed. The main contribution of this paper is the introduction of a robust multimodal framework for estimating emotional mimicry intensity that effectively integrates textual anchors to mitigate the impact of noisy signals. This work represents a meaningful advancement in the field of affective computing, particularly in its approach to handling real-world challenges in multimodal data processing.
Extracting a target source from underdetermined mixtures is challenging for beamforming approaches. Recently proposed time-frequency-bin-wise switching (TFS) and linear combination (TFLC) strategies mitigate this by combining multiple beamformers in each time-frequency (TF) bin and choosing combination weights that minimize the output power. However, making this decision independently for each TF bin can weaken temporal-spectral coherence, causing discontinuities and consequently degrading extraction performance. In this paper, we propose a novel neural network-based time-frequency-bin-wise linear combination (NN-TFLC) framework that constructs minimum power distortionless response (MPDR) beamformers without explicit noise covariance estimation. The network encodes the mixture and beamformer outputs, and predicts temporally and spectrally coherent linear combination weights via a cross-attention mechanism. On dual-microphone mixtures with multiple interferers, NN-TFLC-MPDR consistently outperforms TFS/TFLC-MPDR and achieves competitive performance with TFS/TFLC built on the minimum variance distortionless response (MVDR) beamformers that require noise priors.
Primary: unknown
All Institutions: unknown
This paper presents a novel neural network-based framework for target source extraction from underdetermined mixtures, significantly advancing the field of audio signal processing. The methodology effectively combines traditional beamforming concepts with modern neural network techniques, yielding promising results that could enhance various audio applications.
The proposed NN-TFLC framework introduces a neural network-based approach to time-frequency-bin-wise linear combination of beamformers, addressing the limitations of traditional methods by utilizing a cross-attention mechanism to maintain temporal-spectral coherence. The methodology is well-structured, leveraging existing concepts in beamforming while innovatively applying neural networks to enhance performance in underdetermined scenarios. The use of inplace convolutional gated linear units and Bi-LSTM for temporal context modeling is particularly noteworthy, as it allows for effective feature extraction without losing time-frequency resolution.
The experiments are robust, employing a comprehensive dataset synthesized from clean utterances and simulating realistic acoustic environments. The paper provides a thorough comparison with baseline methods, demonstrating consistent improvements in SI-SDR and SI-SIR metrics across various scenarios. The results are well-presented, with clear tables and visualizations that effectively illustrate the advantages of the proposed method over existing techniques.
While the paper outlines the methodology and experimental setup in detail, it lacks specific implementation details such as code availability or links to datasets, which could hinder reproducibility. The absence of a demo or project URL further limits the ability for others to validate the findings.
One limitation is the reliance on dual-microphone setups, which may not generalize well to more complex array configurations. Additionally, the performance in real-world scenarios with varying noise conditions and dynamic environments remains to be evaluated. The model's scalability to larger numbers of microphones or sources also warrants further investigation.
The proposed method has significant implications for real-time audio processing applications, such as telecommunications, hearing aids, and assistive listening devices, where effective source separation is crucial. Its ability to operate without explicit noise covariance estimation could simplify deployment in practical scenarios. The framework's adaptability to various input configurations also suggests potential for broader applications in multi-source audio environments. This paper presents a novel neural network-based framework for target source extraction from underdetermined mixtures, significantly advancing the field of audio signal processing. The methodology effectively combines traditional beamforming concepts with modern neural network techniques, yielding promising results that could enhance various audio applications.
We present the CodecMOS-Accent dataset, a mean opinion score (MOS) benchmark designed to evaluate neural audio codec (NAC) models and the large language model (LLM)-based text-to-speech (TTS) models trained upon them, especially across non-standard speech like accented speech. The dataset comprises 4,000 codec resynthesis and TTS samples from 24 systems, featuring 32 speakers spanning ten accents. A large-scale subjective test was conducted to collect 19,600 annotations from 25 listeners across three dimensions: naturalness, speaker similarity, and accent similarity. This dataset does not only represent an up-to-date study of recent speech synthesis system performance but reveals insights including a tight relationship between speaker and accent similarity, the predictive power of objective metrics, and a perceptual bias when listeners share the same accent with the speaker. This dataset is expected to foster research on more human-centric evaluation for NAC and accented TTS.
Primary: National Institute Of Information And Communications Technology
All Institutions: Nagoya University, National Institute Of Information And Communications Technology, University of Edinburgh
The main contribution of this paper is the introduction of the CodecMOS-Accent dataset, which provides a comprehensive benchmark for evaluating neural audio codecs and TTS systems across various English accents. This work significantly advances the understanding of how these systems perform with non-standard speech, paving the way for more inclusive and effective speech synthesis technologies.
The methodology is robust, focusing on the creation of the CodecMOS-Accent dataset, which includes a large-scale subjective evaluation of neural audio codecs and TTS systems across various English accents. The authors employed a well-structured listening test design, ensuring a diverse representation of accents and a significant number of annotations, which enhances the credibility of their findings. The inclusion of both subjective and objective evaluation metrics is commendable, providing a comprehensive understanding of the performance of the evaluated systems.
The experiments are thorough, with a clear focus on evaluating the performance of various TTS and NAC systems using a well-defined dataset. The analysis of subjective scores against objective metrics offers valuable insights into the relationship between human perception and automated evaluations. The findings regarding the correlation between accent similarity and speaker identity are particularly noteworthy, indicating a deeper understanding of the nuances in speech synthesis.
The paper lacks specific implementation details or links to the dataset and models used, which could hinder reproducibility. While the methodology is described in detail, providing access to the dataset and models would significantly enhance the ability of other researchers to replicate the study.
One limitation is the potential bias introduced by the listener demographics, as most listeners were from the US, which may affect the generalizability of the results. Additionally, the reliance on subjective evaluations may introduce variability based on listener preferences and experiences. The authors acknowledge that the dataset will be made public, but the timeline for this release is not specified.
This work has significant implications for the development of more human-centric evaluation methods in speech synthesis, particularly for accented speech. The findings could influence future research directions in TTS and NAC systems, promoting the need for diverse training data and evaluation metrics that account for cultural and linguistic variations. The dataset itself is expected to serve as a valuable resource for researchers aiming to improve the quality and naturalness of synthesized speech across different accents. The main contribution of this paper is the introduction of the CodecMOS-Accent dataset, which provides a comprehensive benchmark for evaluating neural audio codecs and TTS systems across various English accents. This work significantly advances the understanding of how these systems perform with non-standard speech, paving the way for more inclusive and effective speech synthesis technologies.
Existing accent normalization methods do not typically offer control over accent strength, yet many applications-such as language learning and dubbing-require tunable accent retention. We propose DLM-AN, a controllable accent normalization system built on masked discrete diffusion over self-supervised speech tokens. A Common Token Predictor identifies source tokens that likely encode native pronunciation; these tokens are selectively reused to initialize the reverse diffusion process. This provides a simple yet effective mechanism for controlling accent strength: reusing more tokens preserves more of the original accent. DLM-AN further incorporates a flow-matching Duration Ratio Predictor that automatically adjusts the total duration to better match the native rhythm. Experiments on multi-accent English data show that DLM-AN achieves the lowest word error rate among all compared systems while delivering competitive accent reduction and smooth, interpretable accent strength control.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Nanjing University, School of Intelligence Science and Technology, Shenzhen Loop Area Institute, Tencent Ethereal Audio Lab
The paper presents DLM-AN, a controllable accent normalization system that effectively balances accent retention and content preservation through innovative methodologies. This contribution is significant for advancing the field of speech processing, particularly in applications requiring nuanced accent control.
The proposed DLM-AN system introduces a novel approach to accent normalization by leveraging masked discrete diffusion and a Common Token Predictor (CTP) to control accent strength. The methodology effectively combines self-supervised speech tokens with a flow-matching Duration Ratio Predictor, allowing for nuanced control over both accent retention and speech rhythm. The use of a bidirectional Transformer for token prediction and the iterative generation process enhances the model's ability to produce high-quality outputs while maintaining phonetic integrity. However, the reliance on a recognition-based token encoder may introduce errors that could affect performance on heavily accented inputs.
The experiments conducted on multi-accent English data demonstrate that DLM-AN achieves the lowest word error rate (WER) among competing systems while maintaining competitive naturalness and accent reduction. The evaluation metrics include both subjective assessments (MUSHRA tests for naturalness and accentedness) and objective measures (WER, Speaker Encoding Cosine Similarity, and phonetic posteriorgram distance), providing a comprehensive view of the system's performance. The results indicate that the proposed method effectively balances accent normalization and content preservation.
The paper provides a detailed description of the experimental setup, including datasets, training procedures, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository limits the ability of other researchers to replicate the results fully. The authors mention using specific models and datasets, but without access to the exact implementations, some aspects may be challenging to reproduce.
The paper acknowledges several limitations, including the potential for recognition errors in the token encoder, which can degrade conversion quality for heavily accented inputs. Additionally, the current system relies on a K-Means tokenizer, which may not capture the full phonetic richness necessary for optimal performance. Future work could explore incorporating L2-accented data and improving the tokenizer to enhance robustness.
The DLM-AN system has significant implications for applications in language learning, dubbing, and personalized text-to-speech systems, where controllable accent normalization is crucial. By enabling users to adjust accent strength, the system can facilitate better communication and understanding in multilingual contexts. The research contributes to the broader field of speech processing and accent conversion, paving the way for more sophisticated and user-friendly audio technologies. The paper presents DLM-AN, a controllable accent normalization system that effectively balances accent retention and content preservation through innovative methodologies. This contribution is significant for advancing the field of speech processing, particularly in applications requiring nuanced accent control.
Chain-of-thought (CoT) prompting has been extended to large audio-language models (LALMs) to elicit reasoning, yet enhancing its effectiveness without training remains challenging. We study inference-time model steering as a training-free approach to improve LALM reasoning. We introduce three strategies using diverse information sources and evaluate them across four LALMs and four benchmarks. Results show general accuracy gains up to 4.4% over CoT prompting. Notably, we identify a cross-modal transfer where steering vectors derived from few text samples effectively guide speech-based reasoning, demonstrating high data efficiency. We also examine hyperparameter sensitivity to understand the robustness of these approaches. Our findings position model steering as a practical direction for strengthening LALM reasoning.
Primary: National Taiwan University
All Institutions: National Taiwan University
The paper presents a training-free model steering framework that enhances Chain-of-Thought reasoning in large audio-language models. Its innovative approach, comprehensive experimental evaluation, and potential for broader applications position it as a significant contribution to the field of machine learning and audio processing.
The paper introduces a novel approach to enhance Chain-of-Thought (CoT) reasoning in large audio-language models (LALMs) through a training-free model steering method. The methodology is well-structured, comprising two phases: extraction of steering vectors and their injection during inference. The three proposed strategiesโVanilla Steering, Speech-derived Generalized Steering (SGS), and Text-derived Generalized Steering (TGS)โare innovative in their approach to leverage existing data without requiring additional training. The use of generalized steering vectors for improving reasoning across different modalities is particularly noteworthy, showcasing a solid understanding of the limitations of current LALMs.
The experiments are comprehensive, involving four advanced LALMs and multiple benchmarks, which provide a robust evaluation of the proposed methods. The results demonstrate consistent improvements in accuracy over CoT prompting, with a maximum gain of 4.4%. The comparison with baselines, including self-consistency, is well-executed, highlighting the efficiency of the proposed methods. The analysis of hyperparameter sensitivity and data efficiency adds depth to the experimental evaluation, indicating a thorough investigation of the methods' robustness.
While the paper provides a clear description of the methodology and experimental setup, it lacks specific implementation details that would facilitate reproducibility, such as code availability or links to datasets used. Providing such resources would significantly enhance the reproducibility of the findings.
One limitation noted is the sensitivity of Vanilla Steering to hyperparameters, which can lead to instability in predictions. Additionally, the reliance on auxiliary datasets for SGS and TGS may limit the applicability of these methods in scenarios where such data is not readily available. The paper could benefit from a more detailed discussion on the potential trade-offs between accuracy and computational efficiency.
The proposed methods have significant implications for the development of more efficient and effective reasoning capabilities in LALMs, which could enhance their applicability in various real-world tasks, such as interactive auditory intelligence and spoken reasoning applications. The findings could lead to advancements in multimodal AI systems, improving their ability to understand and process complex auditory information. The paper presents a training-free model steering framework that enhances Chain-of-Thought reasoning in large audio-language models. Its innovative approach, comprehensive experimental evaluation, and potential for broader applications position it as a significant contribution to the field of machine learning and audio processing.
Despite the strong performance of large audio language models (LALMs) in various tasks, exactly how and where they integrate acoustic features with textual context remains unclear. We adapt causal tracing to investigate the internal information flow of LALMs during audio comprehension. By conducting layer-wise and token-wise analyses across DeSTA, Qwen, and Voxtral, we evaluate the causal effects of individual hidden states. Layer-wise analysis identifies different fusion strategies, from progressive integration in DeSTA to abrupt late-stage fusion in Qwen. Token-wise analysis shows that the final sequence token acts as an informational bottleneck where the network decisively retrieves relevant information from the audio. We also observe an attention-like query mechanism at intermediate token positions that triggers the model to pull task-relevant audio context. These findings provide a clear characterization of when and where multi-modal integration occurs within LALMs.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, National Taiwan University
The main contribution of this paper is the introduction of causal tracing to analyze audio-text fusion in LALMs, revealing critical insights into the integration mechanisms of these models. This work stands out for its innovative approach and potential to influence future research directions in multimodal machine learning.
The paper employs causal tracing to analyze the internal workings of large audio language models (LALMs), which is a novel approach in the context of audio-text fusion. The methodology includes both layer-wise and token-wise analyses, allowing for a comprehensive understanding of how acoustic features and textual context are integrated. This dual analysis is well-justified and provides a clear framework for understanding the model's behavior, making it a significant methodological contribution.
The experiments are well-structured, utilizing multiple LALMs (DeSTA, Qwen, and Voxtral) to validate the findings. The results indicate varying fusion strategies among models, which is a critical insight for future research. However, the paper could benefit from more extensive datasets or benchmarks to further substantiate the findings, particularly in real-world applications.
The paper lacks detailed implementation specifics, which may hinder reproducibility. While the methodology is sound, the absence of code or datasets makes it challenging for other researchers to replicate the study. Providing a GitHub repository or supplementary materials would enhance reproducibility.
One limitation is the focus on only three models, which may not represent the full spectrum of LALMs. Additionally, the analysis primarily focuses on the internal mechanisms without extensive evaluation of the models' performance on downstream tasks, which could provide a more holistic view of their utility.
The findings have significant implications for the development of more effective multimodal models that can better integrate audio and textual information. This research could inform future designs of LALMs, leading to advancements in applications such as audio understanding, question answering, and other areas where audio and text converge. The main contribution of this paper is the introduction of causal tracing to analyze audio-text fusion in LALMs, revealing critical insights into the integration mechanisms of these models. This work stands out for its innovative approach and potential to influence future research directions in multimodal machine learning.
Recent advances in text-to-audio generation enable models to translate natural-language descriptions into diverse musical output. However, the robustness of these systems under semantically equivalent prompt variations remains largely unexplored. Small linguistic changes may lead to substantial variation in generated audio, raising concerns about reliability in practical use. In this study, we evaluate the semantic fragility of text-to-audio systems under controlled prompt perturbations. We selected MusicGen-small, MusicGen-large, and Stable Audio 2.5 as representative models, and we evaluated them under Minimal Lexical Substitution (MLS), Intensity Shifts (IS), and Structural Rephrasing (SR). The proposed dataset contains 75 prompt groups designed to preserve semantic intent while introducing localized linguistic variation. Generated outputs are compared through complementary spectral, temporal, and semantic similarity measures, enabling robustness analysis across multiple representational levels. Experimental results show that larger models achieve improved semantic consistency, with MusicGen-large reaching cosine similarities of 0.77 under MLS and 0.82 under IS. However, acoustic and temporal analyses reveal persistent divergence across all models, even when embedding similarity remains high. These findings indicate that fragility arises primarily during semantic-to-acoustic realization rather than multi-modal embedding alignment. Our study introduces a controlled framework for evaluating robustness in text-to-audio generation and highlights the need for multi-level stability assessment in generative audio systems.
Primary: Northwestern University
All Institutions: Northwestern University
This paper makes a meaningful contribution by systematically evaluating the robustness of text-to-audio generation systems under controlled prompt variations, revealing critical insights into model performance and the need for improved stability in generative audio systems. The comprehensive methodology and rigorous experimental design underscore its significance in advancing the field of machine learning and audio generation.
The paper introduces a novel framework for evaluating semantic fragility in text-to-audio generation systems, employing a systematic approach to assess model robustness under controlled prompt perturbations. The methodology is well-structured, utilizing three distinct perturbation categories (Minimal Lexical Substitution, Intensity Shifts, and Structural Rephrasing) to evaluate the models. The use of a dataset designed to maintain semantic intent while varying linguistic structure is a strong point, as it allows for a focused analysis of model sensitivity. The combination of multiple evaluation metrics (log-Mel spectrogram distance, MFCC-based Dynamic Time Warping, and CLAP embedding similarity) provides a comprehensive view of the models' performance across different dimensions of audio generation.
The experiments are rigorously conducted, with clear definitions of the perturbation types and a well-defined dataset. The results indicate that larger models tend to show improved robustness, which is consistent with existing literature on model scaling. The statistical analysis, including paired-sample t-tests and effect sizes, adds credibility to the findings. However, the paper could benefit from more extensive qualitative evaluations involving multiple listeners to complement the quantitative metrics.
The paper provides a detailed description of the experimental setup, including the dataset construction process and evaluation metrics. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider sharing the dataset and evaluation framework to facilitate further research in this area.
One limitation of the study is the relatively small size of the dataset, which may not capture the full range of linguistic variations encountered in real-world applications. Additionally, while the paper highlights the importance of multi-level evaluation, it does not explore potential methods for improving robustness in audio generation systems, which could be a valuable direction for future research.
The findings of this study have significant implications for the development of more reliable text-to-audio generation systems, particularly in creative industries where semantic fidelity is crucial. By identifying the fragility of current models, the research encourages the exploration of more robust architectures and evaluation frameworks, potentially leading to advancements in AI-assisted music production and interactive media. This paper makes a meaningful contribution by systematically evaluating the robustness of text-to-audio generation systems under controlled prompt variations, revealing critical insights into model performance and the need for improved stability in generative audio systems. The comprehensive methodology and rigorous experimental design underscore its significance in advancing the field of machine learning and audio generation.
In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of California Irvine
The main contribution of this paper is the introduction of an LLM-guided reinforcement learning framework for audio-visual speech enhancement, which enhances interpretability and aligns model training with human perceptual quality assessments. This work represents a significant advancement in the field, combining innovative methodologies with rigorous experimental validation to address longstanding challenges in speech enhancement.
The proposed methodology introduces a novel reinforcement learning framework that leverages Large Language Models (LLMs) to generate interpretable reward signals for audio-visual speech enhancement. The integration of natural language descriptions as rewards represents a significant departure from traditional scalar metrics, enhancing the interpretability and alignment of the training process with human perception. The use of sentiment analysis to convert these descriptions into numerical scores for reinforcement learning is innovative and adds depth to the reward model design. However, the methodology could benefit from a more detailed exploration of the LLM's limitations and the potential for more advanced models to enhance the reward generation process.
The experimental evaluation is robust, utilizing the 4th COG-MHEAR AVSE Challenge dataset, which provides a comprehensive benchmark for assessing the proposed method's performance. The results demonstrate significant improvements over both a supervised baseline and a DNSMOS-based RL baseline across multiple objective and subjective metrics, including PESQ and STOI. The subjective listening tests further validate the effectiveness of the LLM-based reward model, showcasing its ability to enhance perceived speech quality. The experiments are well-structured, but additional comparisons with other state-of-the-art methods could strengthen the claims of superiority.
The paper provides a clear description of the methodology, including the architecture of the AVSE model and the reinforcement learning framework. However, the lack of publicly available code or a demo URL limits reproducibility. Detailed hyperparameter settings and training procedures are mentioned, but sharing the model weights and training scripts would facilitate independent verification of the results.
The primary limitation identified is the reliance on a specific LLM (SALMONN) for generating reward signals, which may not generalize well across different speech enhancement tasks or datasets. Additionally, the repetitive nature of the generated natural language descriptions could hinder the model's ability to capture subtle differences in speech quality. Future work should address these limitations by exploring more advanced LLMs and refining prompt engineering strategies.
The proposed framework has the potential to significantly impact the field of audio-visual speech enhancement by providing a more interpretable and human-aligned approach to model training. The integration of LLMs into the reinforcement learning paradigm could pave the way for advancements in other multimodal applications, enhancing the interpretability and effectiveness of AI systems in various domains, including assistive technologies for hearing-impaired individuals. The main contribution of this paper is the introduction of an LLM-guided reinforcement learning framework for audio-visual speech enhancement, which enhances interpretability and aligns model training with human perceptual quality assessments. This work represents a significant advancement in the field, combining innovative methodologies with rigorous experimental validation to address longstanding challenges in speech enhancement.
Full-duplex voice agents--systems that listen and speak simultaneously--are rapidly moving from research to production. However, existing evaluations address conversational dynamics and task completion in isolation. We introduce $ฯ$-voice, a benchmark for evaluating voice agents on grounded tasks with real-world complexity: agents must navigate complex multi-turn conversations, adhere to domain policies, and interact with the environment. The framework extends $ฯ^2$-bench into a novel voice agent benchmark combining verifiable completion of complex grounded tasks, full-duplex interaction, and realistic audio--enabling direct comparison between voice and text performance. A controllable and realistic voice user simulator provides diverse accents, realistic audio environments, and rich turn-taking dynamics; by decoupling simulation from wall-clock time, the user simulator can use the most capable LLM without real-time constraints. We evaluate task completion (pass@1) and voice interaction quality across 278 tasks: while GPT-5 (reasoning) achieves 85%, voice agents reach only 31--51% under clean conditions and 26--38% under realistic conditions with noise and diverse accents--retaining only 30--45% of text capability; qualitative analysis confirms 79--90% of failures stem from agent behavior, suggesting that observed failures primarily reflect agent behavior under our evaluation setup. $ฯ$-voice provides a reproducible testbed for measuring progress toward voice agents that are natural, conversational, and reliable.
Primary: Sierra.ai
All Institutions: Sierra.ai, Princeton University
The main contribution of this paper is the introduction of the $ฯ$-voice benchmark, which provides a comprehensive framework for evaluating full-duplex voice agents in real-world scenarios. This work is significant as it highlights the current limitations of voice agents compared to text-based models and sets the stage for future research aimed at enhancing the capabilities of voice technology in complex environments.
The methodology proposed in this paper is robust, introducing the $ฯ$-voice benchmark that effectively combines multi-turn conversational dynamics with grounded task completion. The authors utilize a controllable voice user simulator that enhances realism by incorporating diverse accents and audio environments, which is a significant advancement over previous benchmarks. The decoupling of simulation from wall-clock time allows for the use of advanced LLMs without real-time constraints, which is a clever approach that enhances the evaluation process. However, the paper could benefit from a more detailed explanation of the simulation parameters and how they affect the results.
The experimental setup is comprehensive, evaluating task completion across 278 tasks under varying conditions. The results indicate a significant performance gap between voice agents and text-based models, which is critical for understanding the current limitations of voice technology. The quantitative results are backed by qualitative analysis, which adds depth to the findings. However, the paper could improve by providing more detailed statistics on the types of tasks where voice agents struggle the most.
The paper includes a link to the project repository, which is a positive aspect for reproducibility. However, the details regarding the implementation of the voice user simulator and the specific configurations used in experiments are somewhat sparse. More thorough documentation and guidelines would enhance reproducibility for future researchers looking to build upon this work.
The primary limitation identified is the performance disparity between voice agents and text-based models, which raises questions about the current capabilities of voice technology in real-world applications. Additionally, the focus on clean and realistic conditions may not fully capture the variability encountered in everyday use, potentially skewing the results. The authors also acknowledge that a significant portion of failures stems from agent behavior, indicating a need for further research into improving agent design.
The $ฯ$-voice benchmark has the potential to significantly influence the development of voice agents, particularly in applications requiring natural and reliable interactions. By providing a structured evaluation framework, it encourages researchers to focus on improving conversational dynamics and task completion in voice systems, which could lead to advancements in various domains, including customer service, education, and accessibility. The main contribution of this paper is the introduction of the $ฯ$-voice benchmark, which provides a comprehensive framework for evaluating full-duplex voice agents in real-world scenarios. This work is significant as it highlights the current limitations of voice agents compared to text-based models and sets the stage for future research aimed at enhancing the capabilities of voice technology in complex environments.
Diffusion and flow matching TTS faces a tension between discrete temporal structure and continuous spectral modeling. Two-stage models diffuse on fixed alignments, often collapsing to mean prosody; single-stage models avoid explicit durations but suffer alignment instability. We propose a jump-diffusion framework where discrete jumps model temporal structure and continuous diffusion refines spectral content within one process. Even in its one-shot degenerate form, our framework achieves 3.37% WER vs. 4.38% for Grad-TTS with improved UTMOSv2 on LJSpeech. The full iterative UDD variant further enables adaptive prosody, autonomously inserting natural pauses in out-of-distribution slow speech rather than stretching uniformly. Audio samples are available at https://anonymousinterpseech.github.io/TTS_Demo/.
Primary: School of Computer Science
All Institutions: School of Computer Science
The paper presents a novel jump-diffusion framework that unifies discrete temporal structure modeling and continuous spectral refinement for TTS. This comprehensive analysis highlights the technical contributions, innovative methodology, and significant implications for the field of speech synthesis.
The proposed jump-diffusion framework effectively integrates discrete temporal structure modeling with continuous spectral refinement, addressing the limitations of existing two-stage and single-stage TTS models. The Upsample-Diffuse-Downsample (UDD) strategy is particularly innovative, allowing for efficient reuse of pretrained networks while maintaining performance. The methodology is well-structured, with clear definitions of the forward and reverse processes, and the use of a Location Predictor and Content Predictor enhances the model's flexibility in generating speech.
The experiments conducted on the LJSpeech dataset are thorough, comparing the proposed model against established baselines like Grad-TTS. The reported results, including a significant reduction in word error rate (WER) and improvements in naturalness metrics, demonstrate the effectiveness of the jump-diffusion framework. The adaptive prosody feature, particularly in out-of-distribution scenarios, showcases the model's practical applicability in real-world speech synthesis.
The paper provides sufficient implementation details, including the architecture choices and training procedures, which facilitate reproducibility. However, the reliance on pretrained models and the absence of a public code repository may hinder some aspects of reproducibility for the broader research community.
While the jump-diffusion framework shows promise, it currently focuses on temporal structure without addressing potential improvements in spectral content refinement. The model's performance in more complex scenarios, such as multi-speaker synthesis or spontaneous speech, remains to be fully evaluated. Additionally, the lack of a comprehensive comparison with more recent TTS models may limit the contextual understanding of its advantages.
This research has the potential to significantly advance the field of text-to-speech synthesis by improving the naturalness and intelligibility of generated speech. The ability to adaptively insert pauses and handle varying speech rates could enhance user experiences in applications such as virtual assistants, audiobooks, and accessibility tools. The findings may also inspire further research into integrating discrete and continuous modeling approaches in other generative tasks. The paper presents a novel jump-diffusion framework that unifies discrete temporal structure modeling and continuous spectral refinement for TTS. This comprehensive analysis highlights the technical contributions, innovative methodology, and significant implications for the field of speech synthesis.
Objective estimators of multimedia quality are often judged by comparing estimates with subjective "truth data," most often via Pearson correlation coefficient (PCC) or mean-squared error (MSE). But subjective test results contain noise, so striving for a PCC of 1.0 or an MSE of 0.0 is neither realistic nor repeatable. Numerous efforts have been made to acknowledge and appropriately accommodate subjective test noise in objective-subjective comparisons, typically resulting in new analysis frameworks and figures-of-merit. We take a different approach. By making only basic assumptions, we derive bounds on PCC and MSE that can be expected for a subjective test. Consistent with intuition, these bounds are functions of subjective vote variance. When a subjective test includes vote variance information, the calculation of the bounds is easy, and in this case we say the resulting bounds are "fully data-driven." We provide two options for calculating bounds in cases where vote variance information is not available. One option is to use vote variance information from other subjective tests that do provide such information, and the second option is to use a model for subjective votes. Thus we introduce a binomial-based model for subjective votes (BinoVotes) that naturally leads to a mean opinion score (MOS) model, named BinoMOS, with multiple unique desirable properties. BinoMOS reproduces the discrete nature of MOS values and its dependence on the number of votes per file. This modeling provides vote variance information required by the PCC and MSE bounds and we compare this modeling with data from 18 subjective tests. The modeling yields PCC and MSE bounds that agree very well with those found from the data directly. These results allow one to set expectations for the PCC and MSE that might be achieved for any subjective test, even those where vote variance information is not available.
Primary: Institute for Telecommunication Sciences
All Institutions: Institute for Telecommunication Sciences, National Telecommunications and Information Administration
This paper makes a meaningful contribution to the field of multimedia quality assessment by deriving actionable bounds for the performance of objective estimators based on subjective test results. The introduction of the BinoVotes model presents a novel perspective that enhances the understanding of the relationship between subjective and objective measurements, offering valuable insights for future research and applications in the area.
The paper presents a novel approach to deriving bounds on the Pearson correlation coefficient (PCC) and mean-squared error (MSE) for subjective tests in multimedia quality assessment. The authors introduce the BinoVotes model, which is based on the binomial distribution, to effectively capture the discrete nature of subjective ratings. This model allows for the derivation of bounds that are grounded in the intrinsic properties of the voting process, making the methodology both intuitive and mathematically sound. The approach is distinct from previous work that often relies on more complex models or assumptions about the distribution of votes.
The authors validate their theoretical bounds using data from 18 subjective tests, demonstrating that their BinoVotes and BinoMOS models yield results that align well with empirical data. The experiments are comprehensive, covering a range of multimedia types and subjective testing conditions, which strengthens the credibility of their findings. However, the paper could benefit from additional experiments that explore the performance of objective estimators across diverse datasets beyond the 18 tests analyzed.
The paper provides a GitHub repository for the implementation of the BinoVotes model and the associated bounds, which enhances reproducibility. However, detailed descriptions of the datasets used in the experiments, including their characteristics and how they were processed, are somewhat limited. More thorough documentation would facilitate better reproducibility of the results.
One limitation of the study is the reliance on the BinoVotes model for cases where vote variance information is not available, which may lead to overestimation of vote variance in some scenarios. Additionally, while the bounds derived are informative, they do not account for the potential biases introduced by the subjective nature of the voting process. Future work could explore incorporating individual subject biases into the model for a more nuanced understanding of vote variance.
The findings of this paper have significant implications for the development of objective quality estimators in multimedia applications. By providing realistic performance bounds, researchers can better understand the limitations of their models and set achievable goals for improvement. The methodology could be applied to various domains beyond multimedia, including any field that relies on subjective assessments, thus broadening its impact. This paper makes a meaningful contribution to the field of multimedia quality assessment by deriving actionable bounds for the performance of objective estimators based on subjective test results. The introduction of the BinoVotes model presents a novel perspective that enhances the understanding of the relationship between subjective and objective measurements, offering valuable insights for future research and applications in the area.
Voice anonymization masks vocal traits while preserving linguistic content, which may still leak speaker-specific patterns. To assess and strengthen privacy evaluation, we propose a dual-stream attacker that fuses spectral and self-supervised learning features via parallel encoders with a three-stage training strategy. Stage I establishes foundational speaker-discriminative representations. Stage II leverages the shared identity-transformation characteristics of voice conversion and anonymization, exposing the model to diverse converted speech to build cross-system robustness. Stage III provides lightweight adaptation to target anonymized data. Results on the VoicePrivacy Attacker Challenge (VPAC) dataset demonstrate that Stage II is the primary driver of generalization, enabling strong attacking performance on unseen anonymization datasets. With Stage III, fine-tuning on only 10\% of the target anonymization dataset surpasses current state-of-the-art attackers in terms of EER.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, NVIDIA AI Technology Centre, Singapore Institute of Technology
The main contribution of this paper is the introduction of DAST, a dual-stream voice anonymization attacker that utilizes a novel three-stage training strategy to enhance the robustness of voice anonymization systems against re-identification attacks. This work represents a meaningful advancement in the field of voice privacy, combining innovative methodology with rigorous experimental validation to address a critical issue in audio machine learning.
The proposed DAST architecture employs a dual-stream model that effectively combines spectral and self-supervised learning features through a three-stage training strategy. This approach is innovative as it not only leverages the strengths of both feature types but also introduces a structured training curriculum that progressively builds the model's capabilities. The staged training allows for robust generalization across different anonymization systems, which is a significant improvement over existing methods that typically focus on single-system training. The methodology is well-justified with theoretical backing and empirical validation through ablation studies.
The experiments are rigorously designed, utilizing the VoicePrivacy Attacker Challenge (VPAC) dataset to evaluate the model's performance. The results demonstrate that the DAST model outperforms existing state-of-the-art attackers, particularly in terms of equal error rates (EER). The systematic evaluation of each training stage provides clear insights into the contributions of the dual-stream architecture and the effectiveness of the three-stage training approach. The use of diverse datasets for training and testing further strengthens the validity of the results.
The paper includes detailed descriptions of the experimental setup, including the datasets used, training configurations, and evaluation metrics. However, it lacks a direct link to code or models, which could hinder reproducibility. The authors mention plans to release pre-trained models upon acceptance, which is a positive step towards facilitating reproducibility.
One limitation is the reliance on the specific datasets used for training and evaluation, which may not fully capture the diversity of real-world anonymization scenarios. Additionally, while the model shows strong performance on the VPAC dataset, its effectiveness on other anonymization systems or in practical applications remains to be fully assessed. The paper does not address potential ethical concerns related to the misuse of voice anonymization attacks.
The research has significant implications for privacy protection in voice communication, particularly as voice technologies become more prevalent. By improving the robustness of voice anonymization systems against re-identification attacks, the work contributes to enhancing user privacy and security. The findings could influence future designs of anonymization systems and inform policy discussions around voice data privacy. The main contribution of this paper is the introduction of DAST, a dual-stream voice anonymization attacker that utilizes a novel three-stage training strategy to enhance the robustness of voice anonymization systems against re-identification attacks. This work represents a meaningful advancement in the field of voice privacy, combining innovative methodology with rigorous experimental validation to address a critical issue in audio machine learning.
Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate control that can be updated mid-utterance on the fly. VoXtream2 combines a distribution matching mechanism over duration states with classifier-free guidance across conditioning signals to improve controllability and synthesis quality. Prompt-text masking enables textless audio prompting, removing the need for prompt transcription. Across standard zero-shot benchmarks and a dedicated speaking-rate test set, VoXtream2 achieves competitive objective and subjective results against public baselines despite a smaller model and less training data. In full-stream mode, it runs 4 times faster than real time with 74 ms first-packet latency on a consumer GPU.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology
VoXtream2 presents a significant advancement in full-stream TTS systems by introducing dynamic speaking rate control and improving synthesis quality through innovative methodologies. The combination of low latency, high intelligibility, and adaptability positions this work as a valuable contribution to the field of machine learning and audio synthesis, with potential applications in various interactive systems.
The methodology presented in VoXtream2 is innovative, particularly in its approach to dynamic speaking rate control and the integration of classifier-free guidance (CFG) for improved speech synthesis. The use of distribution matching over duration states and prompt text masking to enable textless audio prompting demonstrates a significant advancement in TTS systems, addressing key limitations of previous models. The architecture builds on the earlier VoXtream model but introduces critical enhancements that allow for real-time, low-latency speech generation while maintaining high intelligibility and voice cloning capabilities. The detailed description of the model architecture, including the use of the International Phonetic Alphabet (IPA) and the autoregressive Temporal Transformer (TT), illustrates a thoughtful design process aimed at achieving both speed and quality.
The experimental evaluation is robust, utilizing a variety of datasets, including the Emilia spontaneous speech dataset and the HiFiTTS-2 dataset. The results are compared against several state-of-the-art models, showcasing competitive performance in both objective and subjective metrics. The paper includes comprehensive evaluations of static and dynamic speaking rate control, with clear metrics for intelligibility, speaker similarity, and naturalness. The use of both human evaluations and objective metrics like WER, SPK-SIM, and UTMOS adds credibility to the findings. However, the reliance on specific datasets may limit the generalizability of the results.
The paper provides sufficient detail regarding the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The authors mention using a specific implementation of the Llama-3.2 transformer, which is beneficial, but further details on hyperparameters and training conditions would enhance reproducibility.
The paper acknowledges several limitations, including the influence of the acoustic prompt's speaking rate on the generated speech rate and the complexity of the data preprocessing pipeline. The model's performance may degrade under certain conditions, particularly when generating speech at extreme speaking rates. Additionally, while the dynamic speaking rate control is a significant advancement, the model still exhibits some dependency on the prompt rate, which could be a barrier to achieving fully independent control.
VoXtream2 has the potential to significantly impact the development of conversational agents and real-time TTS applications, particularly in scenarios requiring low latency and high adaptability. The ability to generate speech that closely mimics human-like dynamics in speaking rate could enhance user experience in voice-driven interfaces, making interactions more natural and engaging. Furthermore, the advancements in voice cloning and intelligibility could have applications in accessibility technologies and personalized voice synthesis. VoXtream2 presents a significant advancement in full-stream TTS systems by introducing dynamic speaking rate control and improving synthesis quality through innovative methodologies. The combination of low latency, high intelligibility, and adaptability positions this work as a valuable contribution to the field of machine learning and audio synthesis, with potential applications in various interactive systems.
Audio deepfake model attribution aims to mitigate the misuse of synthetic speech by identifying the source model responsible for generating a given audio sample, enabling accountability and informing vendors. The task is challenging, but self-supervised learning (SSL)-derived acoustic features have demonstrated state-of-the-art attribution capabilities, yet the underlying factors driving their success and the limits of their discriminative power remain unclear. In this paper, we systematically investigate how SSL-derived features capture architectural signatures in audio deepfakes. By controlling multiple dimensions of the audio generation process we reveal how subtle perturbations in model checkpoints, text prompts, vocoders, or speaker identity influence attribution. Our results provide new insights into the robustness, biases, and limitations of SSL-based deepfake attribution, highlighting both its strengths and vulnerabilities in realistic scenarios.
Primary: Technical University of Cluj-Napoca
All Institutions: Technical University of Cluj-Napoca, POLITEHNICA Bucharest
The main contribution of this paper is a systematic investigation into the strengths and weaknesses of SSL-derived features for audio deepfake model attribution, revealing critical insights into the robustness and biases of these models. This work is significant as it addresses a pressing societal challenge by enhancing the understanding of audio deepfake attribution, thus contributing to the broader field of audio forensics and accountability in AI-generated content.
The paper employs a systematic approach to investigate the effectiveness of self-supervised learning (SSL) features in audio deepfake model attribution. The authors meticulously control various factors such as model checkpoints, text prompts, vocoders, and speaker identity, which is a significant methodological strength. The use of multiple architectures and the kNN-based attribution system allows for a clear analysis of how SSL features perform under different conditions. However, the reliance on a single dataset (LJSpeech) may limit the generalizability of the findings.
The experiments are well-structured, utilizing a comprehensive evaluation protocol that includes both in-domain and out-of-domain scenarios. The results are presented with clarity, showcasing the performance of the models across various conditions. The use of macro F1-scores for evaluation is appropriate, but the paper could benefit from additional metrics to provide a more nuanced understanding of the models' performance. The confusion matrices and detailed analysis of the results enhance the robustness of the findings.
The paper provides sufficient details regarding the training protocols, architectures, and evaluation methods, which supports reproducibility. The authors mention that all trained models and generated audio samples are available upon request, which is a positive aspect for researchers looking to replicate or build upon this work. However, the lack of a public repository for the code and models limits immediate accessibility.
One notable limitation is the focus on a single dataset, which may not capture the full variability present in real-world audio deepfakes. Additionally, while the authors explore various perturbations, the study does not address the potential impact of more complex factors such as background noise or emotional tone in the audio samples. The paper also acknowledges the need for further investigation into zero-shot voice cloning and other architectures, indicating that the current findings may not be exhaustive.
The implications of this research are significant, particularly in the context of combating the misuse of synthetic speech in fraud and misinformation. By improving model attribution capabilities, the work contributes to the development of accountability measures in AI-generated content. The insights gained from this study could inform future research and practical applications in audio forensics, security, and ethical AI deployment. The main contribution of this paper is a systematic investigation into the strengths and weaknesses of SSL-derived features for audio deepfake model attribution, revealing critical insights into the robustness and biases of these models. This work is significant as it addresses a pressing societal challenge by enhancing the understanding of audio deepfake attribution, thus contributing to the broader field of audio forensics and accountability in AI-generated content.
SpeechLLMs typically combine ASR-trained encoders with text-based LLM backbones, leading them to inherit written-style output patterns unsuitable for text-to-speech synthesis. This mismatch is particularly pronounced in Japanese, where spoken and written registers differ substantially in politeness markers, sentence-final particles, and syntactic complexity. We propose a preference-based alignment approach to adapt Japanese SpeechLLMs for speech-worthy outputs: text that is concise, conversational, and readily synthesized as natural speech. To rigorously evaluate this task, we introduce SpokenElyza, a benchmark for Japanese speech-worthiness derived from ELYZA-tasks-100 with auditory verification by native experts. Experiments show that our approach achieves substantial improvement on SpokenElyza while largely preserving performance on the original written-style evaluation. We will release SpokenElyza to support future research on Japanese spoken dialog systems.
Primary: SB Intuitions
All Institutions: SB Intuitions
The paper presents a novel approach to adapting Japanese SpeechLLMs for speech-worthy outputs, introducing the SpokenElyza benchmark and demonstrating the effectiveness of preference-based alignment. This contribution is significant for advancing the field of audio processing and spoken dialog systems, particularly in languages with distinct spoken and written forms.
The paper introduces a novel preference-based alignment approach for adapting Japanese SpeechLLMs to produce speech-worthy outputs, addressing a significant gap in the existing literature. The methodology is well-structured, utilizing Direct Preference Optimization (DPO) combined with supervised fine-tuning (SFT) to balance the generation of conversationally natural outputs while maintaining instruction-following capabilities. The construction of the SpokenElyza benchmark is a notable strength, as it incorporates human verification to ensure the quality of the generated speech-worthy text.
The experiments are comprehensive, comparing the performance of the proposed method against both the original written-style outputs and the newly created SpokenElyza dataset. The results demonstrate a clear improvement in speech-worthiness while preserving performance on traditional benchmarks, indicating the effectiveness of the proposed approach. The use of LLM-as-Judge for evaluation is a sound choice, although the paper could benefit from more detailed statistical analysis of the results.
The paper provides a clear description of the model architecture and training procedures, which aids in reproducibility. However, the lack of publicly available code or a demo limits the ability for other researchers to replicate the experiments fully. The paper mentions the use of in-house datasets and models, which may not be accessible to the broader community.
One limitation is the focus on the Japanese language, which may restrict the generalizability of the findings to other languages with different spoken and written divergences. Additionally, while the paper addresses the speech-worthiness of responses, it does not explore the potential impact of cultural nuances in conversational styles across different regions in Japan.
The proposed methods and the SpokenElyza benchmark have significant implications for the development of more effective Japanese spoken dialog systems, which can enhance user interactions in various applications, such as virtual assistants and customer service bots. The approach could also inspire similar methodologies in other languages, potentially leading to advancements in multilingual speech synthesis systems. The paper presents a novel approach to adapting Japanese SpeechLLMs for speech-worthy outputs, introducing the SpokenElyza benchmark and demonstrating the effectiveness of preference-based alignment. This contribution is significant for advancing the field of audio processing and spoken dialog systems, particularly in languages with distinct spoken and written forms.
Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme temporal asymmetry between modalities and the resulting token sparsity. We address the inherent information density gap by introducing an Asymmetric Block-Causal Alignment with a zero-truncation Global Prefix that prevents multi-modal synchronization drift. The gradient explosion caused by extreme audio token sparsity during the causal shift is further resolved through an Audio Sink Token mechanism equipped with an Identity RoPE constraint. Finally, a Joint Self-Forcing Distillation paradigm enables the model to dynamically self-correct cumulative cross-modal errors from exposure bias during long rollouts. Empowered by a modality-independent rolling KV-cache inference scheme, OmniForcing achieves state-of-the-art streaming generation at $\sim$25 FPS on a single GPU, maintaining multi-modal synchronization and visual quality on par with the bidirectional teacher.\textbf{Project Page:} \href{https://omniforcing.com}{https://omniforcing.com}
Primary: Peking University
All Institutions: JD Explore Academy, Fudan University, Peking University, The University of Hong Kong
The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
The proposed OmniForcing framework is a significant advancement in real-time joint audio-visual generation, addressing the latency issues of existing models through innovative techniques such as Asymmetric Block-Causal Alignment and Audio Sink Tokens with Identity RoPE. The methodology is well-structured, with a clear focus on overcoming the challenges of temporal asymmetry and training instability in dual-stream architectures. The introduction of a Joint Self-Forcing Distillation paradigm is particularly noteworthy, as it allows the model to dynamically correct cross-modal errors, enhancing the robustness of the generation process.
The experiments are comprehensive, comparing OmniForcing against both bidirectional models and cascaded autoregressive baselines. The evaluation metrics are well-defined, focusing on visual quality, audio fidelity, and real-time inference efficiency. The results demonstrate that OmniForcing achieves state-of-the-art performance, significantly reducing latency while maintaining high-quality outputs, which is crucial for real-time applications.
The paper provides detailed implementation details, including training setups and hyperparameters, which enhances reproducibility. However, the lack of a publicly available code repository or demo may hinder independent verification of results.
One limitation is the inherent trade-off between streaming capability and the full-sequence attention of the original bidirectional model, which may lead to slight reductions in consistency and synchrony compared to the teacher model. Additionally, the reliance on a specific architecture (LTX-2) may limit the generalizability of the findings to other models.
The work has significant implications for real-time applications in multimedia content creation, gaming, and interactive media, where low-latency audio-visual generation is essential. By enabling efficient streaming of synchronized audio and video, OmniForcing could facilitate advancements in various fields, including virtual reality and live performance technologies. The paper presents OmniForcing, a novel framework for real-time joint audio-visual generation that effectively addresses the challenges of latency and training instability in existing models. Its innovative methodologies and comprehensive experimental evaluations position it as a significant contribution to the field of machine learning and multimedia generation.
Human-computer interaction has traditionally relied on the acoustic channel, a dependency that introduces systemic vulnerabilities to environmental noise, privacy constraints, and physiological speech impairments. Silent Speech Interfaces (SSIs) emerge as a transformative paradigm that bypasses the acoustic stage by decoding linguistic intent directly from the neuro-muscular-articulatory continuum. This review provides a high-level synthesis of the SSI landscape, transitioning from traditional transducer-centric analysis to a holistic intent-to-execution taxonomy. We systematically evaluate sensing modalities across four critical physiological interception points: neural oscillations, neuromuscular activation, articulatory kinematics (ultrasound/magnetometry), and pervasive active probing via acoustic or radio-frequency sensing. Critically, we analyze the current paradigm shift from heuristic signal processing to Latent Semantic Alignment. In this new era, Large Language Models (LLMs) and deep generative architectures serve as high-level linguistic priors to resolve the ``informational sparsity'' and non-stationarity of biosignals. By mapping fragmented physiological gestures into structured semantic latent spaces, modern SSI frameworks have, for the first time, approached the Word Error Rate usability threshold required for real-world deployment. We further examine the transition of SSIs from bulky laboratory instrumentation to ``invisible interfaces'' integrated into commodity-grade wearables, such as earables and smart glasses. Finally, we outline a strategic roadmap addressing the ``user-dependency paradox'' through self-supervised foundation models and define the ethical boundaries of ``neuro-security'' to protect cognitive liberty in an increasingly interfaced world.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Hunan Normal University, Hunan University
The paper provides a comprehensive taxonomy and systematic review of Silent Speech Interfaces, highlighting the transition from traditional acoustic-based systems to innovative modalities leveraging Large Language Models. This work is significant as it outlines the potential of SSIs to enhance communication for diverse populations while addressing critical ethical considerations in the field.
The paper presents a comprehensive taxonomy of Silent Speech Interfaces (SSIs), detailing various sensing modalities and their physiological interception points. It transitions from traditional signal processing to modern approaches utilizing Large Language Models (LLMs) for semantic alignment. The methodology is robust, integrating diverse sensing techniques and advanced machine learning architectures, including deep generative models and self-supervised learning. However, the paper could benefit from more empirical validation of the proposed frameworks and a clearer delineation of the methodologies employed in existing studies.
While the paper provides a thorough review of existing literature and benchmarks, it lacks original experimental results or novel datasets. The comprehensive analysis of existing benchmarks, including performance metrics and comparison across modalities, is commendable. However, the absence of new empirical data limits the ability to assess the practical effectiveness of the proposed frameworks.
The paper does not provide specific implementation details or code repositories, which raises concerns about reproducibility. Although it discusses various methodologies and benchmarks, the lack of a clear path for others to replicate the findings diminishes the overall impact of the work.
The review primarily synthesizes existing literature without introducing new experimental findings. Additionally, while it addresses the ethical implications of SSIs, the discussion could be expanded to include more concrete examples of potential misuse or societal impacts. The paper also does not provide a detailed roadmap for future research, which could guide subsequent studies in this rapidly evolving field.
The potential applications of SSIs are significant, ranging from assistive technologies for individuals with speech impairments to secure communication in sensitive environments. The integration of LLMs into SSI frameworks could revolutionize human-computer interaction, making it more inclusive and privacy-preserving. However, the ethical considerations surrounding neuro-security and cognitive liberty must be carefully addressed to prevent misuse. The paper provides a comprehensive taxonomy and systematic review of Silent Speech Interfaces, highlighting the transition from traditional acoustic-based systems to innovative modalities leveraging Large Language Models. This work is significant as it outlines the potential of SSIs to enhance communication for diverse populations while addressing critical ethical considerations in the field.
Audio-Visual Speech Recognition (AVSR) leverages both acoustic and visual information for robust recognition under noise. However, how models balance these modalities remains unclear. We present Dr. SHAP-AV, a framework using Shapley values to analyze modality contributions in AVSR. Through experiments on six models across two benchmarks and varying SNR levels, we introduce three analyses: Global SHAP for overall modality balance, Generative SHAP for contribution dynamics during decoding, and Temporal Alignment SHAP for input-output correspondence. Our findings reveal that models shift toward visual reliance under noise yet maintain high audio contributions even under severe degradation. Modality balance evolves during generation, temporal alignment holds under noise, and SNR is the dominant factor driving modality weighting. These findings expose a persistent audio bias, motivating ad-hoc modality-weighting mechanisms and Shapley-based attribution as a standard AVSR diagnostic.
Primary: Imperial College London
All Institutions: Imperial College London, NatWest AI Research, heartsuit Umberto, spadesuit Maja, spadesuit Stavros
The main contribution of this paper is the introduction of Dr. SHAP-AV, a framework that utilizes Shapley values to analyze and decode the contributions of audio and visual modalities in speech recognition under varying noise conditions. This work significantly advances the understanding of modality interactions in AVSR, providing a foundation for future research and practical applications in the field.
The paper introduces Dr. SHAP-AV, a framework that utilizes Shapley values to analyze the contributions of audio and visual modalities in AVSR. The methodology is well-structured, with three distinct analyses (Global SHAP, Generative SHAP, and Temporal Alignment SHAP) that provide a comprehensive understanding of modality contributions. The use of Shapley values is innovative in this context, as it allows for a nuanced exploration of how models adapt their reliance on audio and visual inputs under varying noise conditions. The framework is theoretically sound and leverages established concepts in game theory to address a practical problem in AVSR.
The experiments conducted on six models across two benchmarks and varying signal-to-noise ratios (SNR) are robust and provide valuable insights into the dynamics of modality contributions. The results demonstrate a clear shift towards visual reliance in noisy environments while maintaining audio contributions, which is a significant finding. The experiments are well-designed, and the analysis of results is thorough, providing a solid basis for the conclusions drawn.
The paper lacks detailed implementation specifics that would facilitate reproducibility, such as hyperparameters, data preprocessing steps, and model architectures. While the project website may provide additional resources, the absence of a code repository or detailed methodological appendices limits the ability of other researchers to replicate the findings.
One limitation is the potential bias towards audio contributions that the authors identify, which may not be fully addressed within the framework. Additionally, the reliance on specific benchmarks may limit the generalizability of the findings to other AVSR tasks or datasets. The paper could also benefit from a broader exploration of how these findings could influence future AVSR model designs.
The findings have significant implications for the development of more robust AVSR systems, particularly in real-world applications where noise is prevalent. By highlighting the modality contributions and their dynamics, this research can inform the design of models that better balance audio and visual inputs, potentially leading to improved performance in challenging environments. The main contribution of this paper is the introduction of Dr. SHAP-AV, a framework that utilizes Shapley values to analyze and decode the contributions of audio and visual modalities in speech recognition under varying noise conditions. This work significantly advances the understanding of modality interactions in AVSR, providing a foundation for future research and practical applications in the field.
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.
Primary: Korea Advanced Institute of Science and Technology (KAIST)
All Institutions: Korea Advanced Institute of Science and Technology (KAIST)
The paper presents a novel training framework, RAF, that enhances GAN vocoders' performance in speech synthesis by leveraging self-supervised learning and relativistic pairing. This work significantly contributes to the field by addressing the limitations of existing training objectives and demonstrating broad applicability across various datasets and vocoder architectures.
The paper introduces a novel training framework called Relativistic Adversarial Feedback (RAF) for GAN-based vocoders, which significantly enhances both in-domain fidelity and generalization to unseen scenarios. The methodology effectively integrates self-supervised learning models to assist discriminators in evaluating sample quality, promoting richer representations in the generator. The use of relativistic pairing for real and fake waveforms is a key innovation that allows for improved modeling of the training data distribution. The framework is well-structured, with clear definitions of the quality gap and discriminator gap, and the adversarial training objective is robustly formulated.
The experiments are comprehensive, utilizing multiple datasets to validate the effectiveness of RAF across various GAN-based vocoders. The results demonstrate consistent performance improvements in both objective and subjective metrics, with RAF-trained models outperforming traditional LSGAN models in perceptual quality while using fewer parameters. The inclusion of ablation studies strengthens the evaluation, providing insights into the contributions of different components of the RAF framework.
The authors provide a link to the source code for reproducing results, which is a positive aspect for ensuring reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameter choices and training configurations, to facilitate easier replication by other researchers.
The paper acknowledges limitations regarding the computational costs associated with training RAF due to the use of long segments and heavy SSL models. Additionally, the authors do not explore lightweight alternatives or provide rigorous theoretical explanations for the convergence of RAF. There are also ethical considerations regarding the potential misuse of realistic audio deepfakes generated by the framework.
The proposed RAF framework has significant implications for the field of speech synthesis and neural vocoding, particularly in enhancing the quality and generalization capabilities of GAN-based models. The integration of self-supervised learning models opens avenues for further research in resource-efficient settings and could contribute to advancements in applications such as text-to-speech and voice conversion systems. The paper presents a novel training framework, RAF, that enhances GAN vocoders' performance in speech synthesis by leveraging self-supervised learning and relativistic pairing. This work significantly contributes to the field by addressing the limitations of existing training objectives and demonstrating broad applicability across various datasets and vocoder architectures.
Reinforcement Learning (RL) has become an effective paradigm for enhancing Large Language Models (LLMs) and visual generative models. However, its application in text-to-audio (TTA) generation remains largely under-explored. Prior work typically employs offline methods like Direct Preference Optimization (DPO) and leverages Contrastive Language-Audio Pretraining (CLAP) models as reward functions. In this study, we investigate the integration of online Group Relative Policy Optimization (GRPO) into TTA generation. We adapt the algorithm for Flow Matching-based audio models and demonstrate that online RL significantly outperforms its offline counterparts. Furthermore, we incorporate rewards derived from Large Audio Language Models (LALMs), which can provide fine-grained scoring signals that are better aligned with human perception. With only 470M parameters, our final model, \textbf{Resonate}, establishes a new SOTA on TTA-Bench in terms of both audio quality and semantic alignment.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, SJTU Paris Elite Institute of Technology, X-LANCE Lab
The main contribution of this paper is the introduction of Resonate, a novel text-to-audio generator that employs online reinforcement learning and LALMs to achieve state-of-the-art performance in audio quality and semantic alignment. This work represents a significant step forward in the field of generative audio models, combining innovative methodologies with rigorous experimental validation to address existing limitations in TTA generation.
The paper presents a novel integration of online reinforcement learning (RL) into text-to-audio (TTA) generation, specifically through the Group Relative Policy Optimization (GRPO) framework. This approach addresses the limitations of offline RL methods by enabling more dynamic and responsive training that aligns better with human preferences. The use of Large Audio Language Models (LALMs) as reward models is particularly innovative, as it allows for fine-grained feedback that enhances the model's performance. The architecture leverages a Flux-style flow Transformer, which is well-suited for the generative tasks at hand. Overall, the methodology is robust, well-structured, and demonstrates a clear advancement over previous techniques.
The experiments are comprehensive, utilizing a large-scale audio-text dataset for pre-training and a well-defined evaluation benchmark (TTA-Bench) for assessing model performance. The results indicate that the proposed Resonate model achieves state-of-the-art performance in both audio quality and semantic alignment, outperforming existing models across various metrics. The inclusion of both objective and subjective evaluation methods strengthens the findings, providing a balanced view of the model's capabilities. The ablation studies further validate the effectiveness of the proposed methods, highlighting the advantages of online RL and LALM-based rewards.
The authors have provided clear details regarding the model architecture, training procedures, and evaluation metrics, which supports reproducibility. The availability of code and model weights on GitHub enhances this aspect, allowing other researchers to replicate the study and build upon the work. However, specific hyperparameter settings and the rationale behind certain design choices could be elaborated further to aid in complete reproducibility.
One limitation is the reliance on the quality of the datasets used for training and evaluation, which may affect the generalizability of the results. Additionally, while the model achieves state-of-the-art performance, the computational efficiency and scalability of the approach in real-world applications could be further explored. The paper does not address potential biases in the training data or the implications of using LALMs as reward models.
The advancements presented in this paper have significant implications for various applications, including automated content creation in filmmaking, gaming, and virtual reality. The integration of RL and LALMs in TTA generation could lead to more intuitive and human-aligned audio synthesis, enhancing user experiences across multimedia platforms. Furthermore, the open-sourcing of the model and code promotes collaboration and innovation within the research community. The main contribution of this paper is the introduction of Resonate, a novel text-to-audio generator that employs online reinforcement learning and LALMs to achieve state-of-the-art performance in audio quality and semantic alignment. This work represents a significant step forward in the field of generative audio models, combining innovative methodologies with rigorous experimental validation to address existing limitations in TTA generation.
Neural vocoders have recently advanced waveform generation, yielding natural and expressive audio. Among these approaches, iSTFT-based vocoders have recently gained attention. They predict a complex-valued spectrogram and then synthesize the waveform via iSTFT, thereby avoiding learned upsampling stages that can increase computational cost. However, current approaches use real-valued networks that process the real and imaginary parts independently. This separation limits their ability to capture the inherent structure of complex spectrograms. We present ComVo, a Complex-valued neural Vocoder whose generator and discriminator use native complex arithmetic. This enables an adversarial training framework that provides structured feedback in complex-valued representations. To guide phase transformations in a structured manner, we introduce phase quantization, which discretizes phase values and regularizes the training process. Finally, we propose a block-matrix computation scheme to improve training efficiency by reducing redundant operations. Experiments demonstrate that ComVo achieves higher synthesis quality than comparable real-valued baselines, and that its block-matrix scheme reduces training time by 25%. Audio samples and code are available at https://hs-oh-prml.github.io/ComVo/.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of ComVo, a complex-valued neural vocoder that enhances waveform generation through a novel adversarial framework and efficient computational techniques. This work represents a significant advancement in the modeling of complex spectrograms, with the potential to improve audio synthesis quality and efficiency in various applications.
The paper presents a novel approach to waveform generation using complex-valued neural networks (CVNNs) in an adversarial framework. The introduction of phase quantization as a structured nonlinearity and the block-matrix computation scheme for efficiency are significant contributions. The methodology effectively integrates complex arithmetic into both the generator and discriminator, allowing for better modeling of the inherent structure of complex spectrograms. The design choices are well-justified, and the paper provides a clear rationale for the benefits of using CVNNs over traditional real-valued networks.
The experiments are thorough, comparing the proposed ComVo model against several strong baselines using both subjective (MOS, SMOS, CMOS) and objective metrics (PESQ, MR-STFT error). The results demonstrate that ComVo consistently outperforms real-valued vocoders, achieving higher synthesis quality and reduced training time. The use of diverse datasets and evaluation metrics strengthens the validity of the findings, and the inclusion of qualitative analyses through Grad-CAM visualizations adds depth to the evaluation.
The paper provides sufficient implementation details, including architecture specifications and training setups, which facilitate reproducibility. The availability of audio samples and code on the provided demo URL further supports this aspect. However, the paper could benefit from a more detailed description of hyperparameter tuning and the specific configurations used for each baseline model.
While the paper acknowledges the higher memory footprint associated with complex-valued parameters, it does not explore potential optimizations for multi-GPU training setups, which could enhance scalability. Additionally, the reliance on split-style designs may limit the flexibility of the model, and future work is needed to explore more advanced architectures.
The integration of CVNNs into waveform generation has the potential to significantly advance the field of speech synthesis and audio processing. By improving the quality of generated audio and reducing computational costs, this work could facilitate the development of more efficient and effective neural vocoders, impacting applications in text-to-speech systems, music generation, and other audio-related technologies. The main contribution of this paper is the introduction of ComVo, a complex-valued neural vocoder that enhances waveform generation through a novel adversarial framework and efficient computational techniques. This work represents a significant advancement in the modeling of complex spectrograms, with the potential to improve audio synthesis quality and efficiency in various applications.
Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of AnimeScore, a novel preference-based framework for evaluating anime-like speech, which combines extensive data collection, acoustic analysis, and advanced ranking models to provide a practical and objective metric for a previously subjective evaluation task. This work significantly advances the field of audio processing and speech synthesis by addressing a unique challenge in evaluating stylistic voice attributes.
The methodology is robust, employing a preference-based framework that collects a substantial dataset of 15,000 pairwise judgments to evaluate 'anime-like' speech. The authors effectively address the challenges of subjective evaluation by utilizing pairwise comparisons, which is more reliable for style-centric attributes. The acoustic analysis is comprehensive, identifying key features that contribute to perceived anime-likeness. The integration of self-supervised learning (SSL) models for ranking demonstrates a forward-thinking approach to automatic evaluation, enhancing the practicality of the framework.
The experiments are well-structured, with a clear focus on validating the proposed framework through rigorous testing of various SSL backbones. The results indicate a significant improvement over traditional handcrafted features, with SSL models achieving up to 90.8% AUC. The paper provides detailed statistical analyses, including pairwise accuracy and ROC-AUC metrics, which lend credibility to the findings. However, the paper could benefit from a more extensive exploration of the implications of these results in real-world applications.
The paper mentions that the dataset and implementation are publicly available, which is a positive aspect for reproducibility. However, specific details regarding the training setups, hyperparameters, and model architectures could be elaborated further to enhance replicability. The absence of a clear code repository link may hinder some researchers from fully reproducing the results.
The study acknowledges limitations such as demographic imbalances in the evaluator pool and the moderate scale of the dataset. Additionally, the lack of ablation studies on model structure limits the understanding of how different components contribute to performance. Future work should address these limitations to strengthen the findings.
The proposed framework has significant implications for the anime industry and speech generation systems, providing a standardized metric for evaluating anime-like speech. This could streamline the development process for generative models, allowing for more efficient iteration and optimization. Furthermore, the insights gained from the acoustic analysis could inform future research on voice synthesis and style transfer in audio applications. The main contribution of this paper is the introduction of AnimeScore, a novel preference-based framework for evaluating anime-like speech, which combines extensive data collection, acoustic analysis, and advanced ranking models to provide a practical and objective metric for a previously subjective evaluation task. This work significantly advances the field of audio processing and speech synthesis by addressing a unique challenge in evaluating stylistic voice attributes.
Speech Emotion Captioning (SEC) leverages large audio-language models to generate rich, context-aware affective descriptions from speech. However, real-world deployment remains challenging due to the substantial computational demands on resource-constrained edge devices and the privacy risks of transmitting biometric audio. While smaller audio-language models enable efficient on-device SEC, their limited capacity often weakens subtle paralinguistic modeling and fine-grained affective grounding. We propose an edge-cloud collaborative framework based on Uncertainty-Guided Speculative Decoding (UGSD). A lightweight edge model drafts captions locally, and only high-uncertainty token blocks are selectively escalated to a stronger cloud verifier for validation. Experiments on the MER2024 benchmark demonstrate substantial BLEU improvements up to 62.7%. UGSD further achieves 1.4x lower latency and 8.5x higher token throughput compared to an edge-only model. These results empirically characterize the quality-efficiency-privacy trade-off in deployable SEC systems.
Primary: unknown
All Institutions: unknown
The paper presents an innovative edge-cloud collaborative framework for Speech Emotion Captioning that effectively addresses computational and privacy challenges. The methodology and experimental results indicate a strong contribution to the field, although improvements in reproducibility and detailed methodological descriptions are needed for broader adoption and validation.
The proposed methodology introduces an edge-cloud collaborative framework that utilizes Uncertainty-Guided Speculative Decoding (UGSD) to enhance Speech Emotion Captioning (SEC). This approach is innovative as it balances the computational load between edge devices and cloud resources, allowing for efficient processing while maintaining privacy. The method's reliance on uncertainty to determine when to escalate processing to the cloud is a notable contribution, as it addresses both efficiency and privacy concerns effectively. However, the paper could benefit from a more detailed description of the UGSD algorithm and its implementation specifics.
The experiments conducted on the MER2024 benchmark are robust, demonstrating significant improvements in BLEU scores and efficiency metrics such as latency and token throughput. The reported BLEU score improvement of up to 62.7% is impressive and indicates a strong performance of the proposed method compared to traditional edge-only models. However, the paper lacks a comprehensive analysis of the datasets used, including the size, diversity, and how they were annotated, which is crucial for understanding the generalizability of the results.
The paper does not provide sufficient details regarding the implementation of the proposed framework, including hyperparameters, model architectures, or training procedures. This lack of transparency hinders reproducibility, as other researchers may struggle to replicate the results without access to the code or detailed methodological descriptions.
One limitation of the proposed approach is its dependency on the quality of the lightweight edge model. If the edge model's performance is subpar, the overall system may not achieve the desired quality in captioning. Additionally, the reliance on cloud resources, while mitigated by the uncertainty-based approach, still poses potential latency issues in real-world applications where immediate responses are required.
The implications of this research are significant, particularly in the context of deploying SEC systems in privacy-sensitive environments. The proposed framework could facilitate the integration of emotion recognition in various applications, such as virtual assistants, mental health monitoring, and interactive entertainment. By addressing the computational and privacy challenges, this work paves the way for more widespread adoption of SEC technologies. The paper presents an innovative edge-cloud collaborative framework for Speech Emotion Captioning that effectively addresses computational and privacy challenges. The methodology and experimental results indicate a strong contribution to the field, although improvements in reproducibility and detailed methodological descriptions are needed for broader adoption and validation.
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we achieve 3.24% WER on Common Voice Swahili-an 82% relative improvement over the baseline. This result surpasses the best previously reported academic system (8.3% WER from XLS-R) by 61% relative improvement. We provide concrete data requirements and a replicable methodology applicable to other low-resource languages.
Primary: Harvard University
All Institutions: Harvard University, Thiomi-Lugha NLP
The paper presents a systematic evaluation of continued pretraining for Swahili ASR, achieving state-of-the-art performance with minimal labeled data. This work is significant in demonstrating the potential of leveraging unlabeled audio in low-resource language settings, providing a replicable methodology that could be applied to other underserved languages.
The methodology presented in the paper is robust and systematic, focusing on the application of continued pretraining (CPT) for low-resource Swahili ASR. The authors clearly outline their three-stage pipeline, which includes training a labeling model, generating pseudo-labels, and performing supervised finetuning. This structured approach is well-justified and effectively demonstrates the potential of CPT in leveraging unlabeled data. The use of a strong baseline model for pseudo-labeling is particularly commendable, as it ensures the quality of the training data.
The experimental evaluation is thorough, with clear comparisons between models trained with and without CPT. The authors provide detailed results across different configurations, showcasing significant improvements in word error rate (WER). The benchmarks established are concrete and relevant, demonstrating a clear advancement over previous systems. The results are well-supported by the experimental design, which isolates the effects of CPT.
The paper provides sufficient details regarding the experimental setup, including hyperparameters, dataset descriptions, and training procedures. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. While the methodology is replicable, the absence of shared resources may hinder broader adoption.
One limitation is the reliance on the quality of the pseudo-labels generated by the baseline model. If the labeling model does not perform well, it could adversely affect the continued pretraining process. Additionally, while the study focuses on Swahili, the generalizability of the findings to other low-resource languages remains to be validated.
The implications of this research are significant, particularly for the over 100 million Swahili speakers who could benefit from improved ASR technology. The methodology could pave the way for advancements in educational technology, accessibility tools, and the preservation of oral traditions in various African languages. The findings underscore the feasibility of developing high-quality ASR systems in low-resource settings, which could have a transformative effect on technology access in these communities. The paper presents a systematic evaluation of continued pretraining for Swahili ASR, achieving state-of-the-art performance with minimal labeled data. This work is significant in demonstrating the potential of leveraging unlabeled audio in low-resource language settings, providing a replicable methodology that could be applied to other underserved languages.
Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker's gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University
The main contribution of this work is the development of a novel scoring protocol and augmentation technique that significantly enhances the speaker verification capabilities of speech-aware LLMs. This research addresses a critical gap in the ability of LLMs to process and discriminate speaker identity, which is essential for advancing applications in voice recognition and personalized AI systems.
The paper introduces a model-agnostic scoring protocol for evaluating speaker verification capabilities in speech-aware LLMs, which is a significant advancement in the field. The methodology is sound, utilizing both confidence scoring and log-likelihood ratios to derive continuous verification scores. The augmentation of LLMs with ECAPA-TDNN speaker embeddings through LoRA adapters is innovative, allowing for the integration of speaker verification capabilities while maintaining the natural language interface of LLMs. The approach is well-structured and addresses a critical gap in existing LLM capabilities regarding speaker identity discrimination.
The experiments are comprehensive, benchmarking various speech-aware LLMs against the VoxCeleb1 dataset. The results demonstrate the weak performance of off-the-shelf models in speaker discrimination, with EERs exceeding 20%. The introduction of the ECAPA-LLM shows a significant improvement, achieving an EER of 1.03%, which is commendable. The evaluation metrics are appropriate, and the use of multiple trials and datasets enhances the robustness of the findings. However, the paper could benefit from more extensive comparisons with state-of-the-art ASV systems.
The paper provides sufficient details regarding the datasets used, the training procedures, and the evaluation metrics, which facilitates reproducibility. However, the absence of a publicly accessible code repository limits the ability for others to replicate the results fully. Including a GitHub link or similar would enhance reproducibility significantly.
The paper acknowledges limitations in the scoring methods, particularly the coarse nature of confidence-based scoring for closed systems and the high failure rates observed in some models. Additionally, the performance of larger models like Ministral3-3B was unexpectedly poor, suggesting potential issues with embedding space alignment that require further investigation. The reliance on specific datasets may also limit the generalizability of the findings.
The implications of this research are substantial, as it paves the way for integrating speaker verification capabilities into general-purpose LLMs. This could enhance applications in biometric authentication, personalized assistants, and dialogue analysis. The findings suggest a promising direction for future research in multimodal AI systems, where both linguistic and speaker identity information can be processed jointly. The main contribution of this work is the development of a novel scoring protocol and augmentation technique that significantly enhances the speaker verification capabilities of speech-aware LLMs. This research addresses a critical gap in the ability of LLMs to process and discriminate speaker identity, which is essential for advancing applications in voice recognition and personalized AI systems.
In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).
Primary: Xiamen University
All Institutions: Xiamen University, Hong Kong SAR, Nanjing University, School of Artificial Intelligence, School of Electronic Science and Engineering, School of Informatics, School of Intelligence Science and Technology, Shenzhen Loop Area Institute, The Chinese University of Hong Kong
The main contribution of this paper is the introduction of AlphaFlowTSE, a novel one-step generative model for target speaker extraction that significantly improves extraction quality and generalization while maintaining low latency. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing the fidelity and efficiency of speaker extraction systems.
The methodology proposed in AlphaFlowTSE is innovative, as it introduces a one-step generative model for target speaker extraction that utilizes a Jacobian-vector product (JVP)-free AlphaFlow objective. The combination of trajectory matching and interval-consistency teacher-student supervision is a significant advancement in the field, addressing the challenges of latency and accuracy in real-world applications. The use of mean-velocity transport in the complex STFT domain is particularly noteworthy, as it aligns training with inference, making the model more efficient.
The experiments conducted on the Libri2Mix and REAL-T datasets are comprehensive and demonstrate the effectiveness of AlphaFlowTSE. The results show improvements in target-speaker similarity and generalization to real conversational mixtures, with strong performance metrics such as PESQ, ESTOI, and SI-SDR. The ablation studies regarding the MR predictor further validate the robustness of the proposed model.
The paper provides detailed implementation details, including training protocols, model architecture, and evaluation metrics, which contribute to reproducibility. However, the lack of a public repository or demo URL limits the ease of access for other researchers to replicate the results.
One limitation is the reliance on the enrollment utterance, which may not always be available in practical scenarios. Additionally, while the model shows strong performance in controlled environments, its effectiveness in highly variable real-world conditions remains to be fully validated.
The advancements in target speaker extraction have significant implications for applications in personal communication systems, such as virtual assistants and conference call technologies. The ability to accurately extract a target speaker's voice in real-time can enhance user experiences in various audio processing tasks, including automatic speech recognition and noise suppression. The main contribution of this paper is the introduction of AlphaFlowTSE, a novel one-step generative model for target speaker extraction that significantly improves extraction quality and generalization while maintaining low latency. This work represents a meaningful advancement in the field of audio processing, particularly in enhancing the fidelity and efficiency of speaker extraction systems.
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.
Primary: Super Intelligence Team
All Institutions: Super Intelligence Team
The paper introduces FireRedASR2S, a state-of-the-art all-in-one ASR system that integrates multiple modules to enhance speech recognition accuracy and robustness across various languages and dialects. The technical contributions are substantial, particularly in the context of modular design and comprehensive evaluation, making it a valuable addition to the field of automatic speech recognition.
The paper presents a comprehensive and modular architecture for an automatic speech recognition system that integrates multiple components (ASR, VAD, LID, Punc) into a unified pipeline. The methodology is well-structured, leveraging state-of-the-art techniques such as the Encoder-Decoder architecture for LID and BERT-style models for punctuation prediction. The use of a large and diverse training corpus enhances the model's generalization capabilities across various dialects and languages, which is a significant methodological strength.
The experimental evaluation is robust, with extensive benchmarking against multiple public datasets, demonstrating superior performance across all components. The results are clearly presented, showing improvements in character error rates and other relevant metrics compared to existing systems. The use of human-annotated data for training VAD is particularly noteworthy, as it enhances the reliability of segmentation in real-world applications.
The authors have committed to open-sourcing their model weights and code, which is a positive step towards reproducibility. However, the paper could benefit from more detailed descriptions of the training processes and hyperparameter settings to facilitate replication of results by other researchers.
While the system shows impressive performance, it may still face challenges in extremely noisy environments or with highly accented speech that diverges from the training data. Additionally, the reliance on large-scale training data may limit accessibility for smaller institutions or researchers with fewer resources.
The FireRedASR2S system has significant implications for various applications, including real-time transcription services, multilingual communication tools, and accessibility technologies. Its modular design allows for flexible deployment in diverse settings, potentially advancing the state of the art in speech recognition technology. The paper introduces FireRedASR2S, a state-of-the-art all-in-one ASR system that integrates multiple modules to enhance speech recognition accuracy and robustness across various languages and dialects. The technical contributions are substantial, particularly in the context of modular design and comprehensive evaluation, making it a valuable addition to the field of automatic speech recognition.
We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.
Primary: Nanjing University
All Institutions: Central Media Technology Institute, Nanjing University, Shanghai Jiao Tong University, Shenzhen Research Institute of Big Data, ETH Zรผrich
G-STAR presents a novel approach to timestamped speaker-attributed ASR, significantly advancing the field of audio processing by integrating speaker tracking with LLMs. The methodology and potential applications underscore its relevance and impact in improving speech recognition systems.
The methodology proposed in G-STAR is innovative, combining a time-aware speaker-tracking module with a Speech-LLM transcription backbone. This dual approach addresses the limitations of existing systems that either focus on local diarization or global labeling, thus enhancing the ability to maintain speaker identity consistency across long-form, multi-party speech. The flexibility of supporting both component-wise optimization and joint end-to-end training is a significant strength, allowing for adaptability in various training scenarios. However, the paper could benefit from a more detailed explanation of the cue fusion process and how it integrates with the LLM.
The experiments conducted are comprehensive, analyzing various aspects such as cue fusion, local versus long-context trade-offs, and hierarchical objectives. However, the paper lacks a detailed description of the datasets used, which is crucial for evaluating the robustness and generalizability of the proposed system. The absence of comparisons with state-of-the-art methods also limits the clarity of G-STAR's performance improvements.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Clearer guidelines on how to replicate the experiments, including hyperparameters and training procedures, would enhance the paper's credibility and utility for the research community.
One of the primary limitations is the potential complexity of the system, which may hinder real-time applications. Additionally, the paper does not address how well the system performs in highly noisy environments or with overlapping speech from multiple speakers, which are common challenges in practical scenarios.
The implications of G-STAR are significant, particularly in applications such as virtual meetings, automated transcription services, and assistive technologies for the hearing impaired. By improving speaker attribution and temporal accuracy in transcripts, this research could enhance communication accessibility and efficiency in multi-party interactions. G-STAR presents a novel approach to timestamped speaker-attributed ASR, significantly advancing the field of audio processing by integrating speaker tracking with LLMs. The methodology and potential applications underscore its relevance and impact in improving speech recognition systems.
Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. To benchmark Geo-AT, Geo-ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion-AT is proposed as a unified geo-audio fusion framework that evaluates feature-, representation-, and decision-level fusion on representative audio backbones, with audio- and GSC-only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo-ATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. The Geo-AT task, benchmark Geo-ATBench, and reproducible geo-audio fusion framework GeoFusion-AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (https://github.com/WuYanru2002/Geo-ATBench).
Primary: University of Oxford
All Institutions: University of Oxford, Xi'an Jiaotong-Liverpool University, KTH Royal Institute of Technology, Ghent University
The paper presents a novel approach to multi-label audio tagging by integrating geospatial semantic context, significantly enhancing the understanding of environmental sounds in complex auditory scenes. The comprehensive methodology and rigorous experimental evaluation contribute to the advancement of the field, establishing a foundation for future research in audio tagging and multimodal learning.
The paper introduces the Geo-AT task, which integrates geospatial semantic context (GSC) with audio for multi-label audio tagging (AT). The methodology is well-structured, presenting a clear framework (GeoFusion-AT) that evaluates various fusion strategies (feature-, representation-, and decision-level). The systematic approach to dataset creation (Geo-ATBench) and the inclusion of human-aligned evaluations enhance the robustness of the proposed methods.
The experiments are comprehensive, utilizing multiple audio backbones and evaluating the performance of models under different fusion strategies. The results demonstrate significant improvements in performance when incorporating GSC, particularly for acoustically similar events. The use of statistical tests to validate improvements adds rigor to the findings.
The authors provide access to the dataset, code, and models, which supports reproducibility. The detailed descriptions of the experimental setup, including data collection and annotation processes, contribute to the transparency of the research.
The study relies on a relatively small sample size for the human evaluation (10 participants), which may limit the generalizability of the findings. Additionally, the dataset is constrained to specific geographic contexts, which might affect the applicability of the results in diverse environments.
The proposed framework and benchmark have the potential to advance research in computational auditory scene analysis (CASA) and multimodal learning by providing a standardized task that incorporates geospatial context. This could lead to improvements in applications such as urban sound monitoring, smart city technologies, and assistive listening devices. The paper presents a novel approach to multi-label audio tagging by integrating geospatial semantic context, significantly enhancing the understanding of environmental sounds in complex auditory scenes. The comprehensive methodology and rigorous experimental evaluation contribute to the advancement of the field, establishing a foundation for future research in audio tagging and multimodal learning.
The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than female listeners--a gap that is most pronounced in low-quality speech and gradually diminishes as quality improves. This quality-dependent structure proves difficult to eliminate through simple calibration. We further demonstrate that automated MOS models trained on aggregated labels exhibit predictions skewed toward male standards of perception. To address this, we propose a gender-aware model that learns gender-specific scoring patterns through abstracting binary group embeddings, thereby improving overall and gender-specific prediction accuracy. This study establishes that gender bias in MOS constitutes a systematic, learnable pattern demanding attention in equitable speech evaluation.
Primary: National Institute of Information and Communications Technology
All Institutions: National Institute of Information and Communications Technology, Nagoya University, National Taiwan University
The main contribution of this paper is the identification and analysis of gender bias in MOS ratings, along with the introduction of a gender-aware model that addresses these biases. This work is significant as it not only uncovers a previously underexplored issue in speech quality assessment but also proposes a novel solution that could enhance fairness in automated evaluations.
The paper presents a systematic analysis of gender bias in Mean Opinion Scores (MOS) for speech quality assessment. It employs a novel approach by abstracting binary group embeddings to create a gender-aware model that learns gender-specific scoring patterns. This methodology is well-structured and addresses a significant gap in the literature regarding the biases in human annotations. The use of gender-specific embeddings is innovative and adds depth to the existing methodologies in speech quality assessment.
The experiments conducted are thorough, revealing a clear disparity in MOS ratings between male and female listeners, particularly in low-quality speech. The authors provide a compelling analysis of how automated MOS models inherit these biases. However, the paper could benefit from more extensive datasets and a broader range of speech quality scenarios to validate the findings further. The results indicate that the proposed model improves prediction accuracy, which is a strong point.
The paper lacks detailed implementation specifics that would allow for full reproducibility of the experiments. While the methodology is sound, the absence of code or a clear description of the experimental setup limits the ability of other researchers to replicate the findings. Including a supplementary material section or a link to a code repository would enhance reproducibility.
One limitation is the focus on gender bias without considering other potential biases (e.g., age, ethnicity) that could also impact MOS ratings. Additionally, the study's reliance on aggregated labels for training automated models may overlook individual listener variability. The authors acknowledge the difficulty in eliminating bias through calibration, suggesting a need for further research in this area.
This research has significant implications for the field of speech quality assessment and machine learning. By highlighting the systematic nature of gender bias in MOS, it calls for more equitable evaluation practices in audio processing systems. The findings could influence future research directions and the development of fairer algorithms in speech technology, ultimately contributing to more inclusive applications. The main contribution of this paper is the identification and analysis of gender bias in MOS ratings, along with the introduction of a gender-aware model that addresses these biases. This work is significant as it not only uncovers a previously underexplored issue in speech quality assessment but also proposes a novel solution that could enhance fairness in automated evaluations.
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong, Applied AI Institute, Central University, Trusted AI Research Center
The main contribution of this paper is the introduction of PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models, which addresses critical gaps in existing methodologies. This work is significant as it provides a systematic approach to robustness verification, crucial for the safe deployment of voice anti-spoofing technologies in real-world applications.
The proposed PV-VASM framework presents a novel approach to verifying the robustness of voice anti-spoofing models through a probabilistic framework. It effectively estimates misclassification probabilities under various transformations, including those from generative models. The methodology is well-structured, utilizing probabilistic concentration inequalities and providing a theoretical upper bound on error probabilities. The model-agnostic nature of PV-VASM enhances its applicability across different voice anti-spoofing models, which is a significant advancement in the field.
The experimental validation of PV-VASM is comprehensive, covering a variety of transformations and generative models. The authors provide detailed results on the performance of their method against both parametric transformations and synthetic speech generation, showcasing its effectiveness. The use of multiple datasets and the inclusion of real-world conditions strengthen the credibility of the findings.
While the paper includes a thorough description of the methodology and experiments, it lacks specific implementation details or links to code repositories, which may hinder reproducibility. The absence of a demo or project URL further limits the ability for others to validate the findings independently.
The paper acknowledges that the robustness of voice anti-spoofing models significantly varies depending on the type of perturbation and the parameter space. The proposed upper bounds may be overly conservative, leading to less practical applicability in certain scenarios. Additionally, the complexity of verifying robustness against generative models presents challenges that are not fully addressed.
The implications of this research are substantial, particularly in enhancing the security of voice recognition systems against deepfake technologies. The ability to certify the robustness of voice anti-spoofing models has significant applications in various domains, including finance, security, and personal privacy. The main contribution of this paper is the introduction of PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models, which addresses critical gaps in existing methodologies. This work is significant as it provides a systematic approach to robustness verification, crucial for the safe deployment of voice anti-spoofing technologies in real-world applications.
Although the deep integration of the Automatic Speech Recognition (ASR) system with Large Language Models (LLMs) has significantly improved accuracy, the deployment of such systems in low-latency streaming scenarios remains challenging. In this paper, we propose Uni-ASR, a unified framework based on LLMs that integrates both non-streaming and streaming speech recognition capabilities. We propose a joint training paradigm that enables the system to seamlessly transition between two recognition modes without any architectural modifications. Furthermore, we introduce a context-aware training paradigm and a co-designed fallback decoding strategy, which can enhance streaming recognition accuracy without introducing additional latency. The experimental results demonstrate that Uni-ASR not only achieves competitive performance within non-streaming mode, but also demonstrates strong effectiveness in streaming scenarios under diverse latency constraints.
Primary: Tongyi AI Lab
All Institutions: Tongyi AI Lab
The main contribution of this paper is the introduction of Uni-ASR, a unified architecture that effectively integrates non-streaming and streaming ASR capabilities, achieving competitive performance across both modes. This work represents a significant advancement in the field of automatic speech recognition, particularly in addressing the challenges of low-latency applications while maintaining high accuracy.
The methodology presented in Uni-ASR is robust, integrating both non-streaming and streaming capabilities within a single architecture. The joint training paradigm and context-aware training approach are innovative, allowing for seamless transitions between modes without architectural changes. The fallback decoding strategy is particularly noteworthy as it addresses the latency issues inherent in streaming ASR, enhancing performance while maintaining low latency. The use of established architectures like Conformer and LLMs adds credibility to the approach, although the paper could benefit from more detailed descriptions of the training dynamics and hyperparameter settings.
The experimental evaluation is comprehensive, utilizing multiple widely recognized benchmarks such as AISHELL, LibriSpeech, and WeNetSpeech. The results demonstrate competitive performance in both non-streaming and streaming scenarios, outperforming several state-of-the-art models. The ablation studies effectively highlight the contributions of the proposed methodologies, particularly the impact of the latest-token fallback decoding strategy and the context-aware training paradigm on streaming performance. However, the paper could enhance its experimental rigor by including more diverse datasets and languages beyond the Chinese-English bilingual corpus.
The paper provides a reasonable level of detail regarding the architecture and training process, which aids reproducibility. However, the lack of specific URLs for code or datasets limits the ability for others to replicate the study fully. Including a link to a GitHub repository or supplementary materials would significantly improve reproducibility.
While the paper presents a strong framework, it is limited by its focus on a specific bilingual corpus, which may not generalize well to other languages or dialects. Additionally, the reliance on a single model (Qwen3-1.7B) for the decoder may restrict the exploration of how different LLMs could impact performance. The paper also does not sufficiently address potential computational costs associated with the unified architecture during deployment in real-world applications.
The implications of this research are significant, as it addresses a critical need for efficient ASR systems that can operate in real-time environments. The integration of LLMs into ASR frameworks could enhance applications in various fields, including accessibility technologies, real-time translation, and interactive voice response systems. By improving the accuracy and efficiency of ASR, this work could facilitate better communication in multilingual contexts and enhance user experiences in voice-activated technologies. The main contribution of this paper is the introduction of Uni-ASR, a unified architecture that effectively integrates non-streaming and streaming ASR capabilities, achieving competitive performance across both modes. This work represents a significant advancement in the field of automatic speech recognition, particularly in addressing the challenges of low-latency applications while maintaining high accuracy.
This paper introduces V2A-DPO, a novel Direct Preference Optimization (DPO) framework tailored for flow-based video-to-audio generation (V2A) models, incorporating key adaptations to effectively align generated audio with human preferences. Our approach incorporates three core innovations: (1) AudioScore-a comprehensive human preference-aligned scoring system for assessing semantic consistency, temporal alignment, and perceptual quality of synthesized audio; (2) an automated AudioScore-driven pipeline for generating large-scale preference pair data for DPO optimization; (3) a curriculum learning-empowered DPO optimization strategy specifically tailored for flow-based generative models. Experiments on benchmark VGGSound dataset demonstrate that human-preference aligned Frieren and MMAudio using V2A-DPO outperform their counterparts optimized using Denoising Diffusion Policy Optimization (DDPO) as well as pre-trained baselines. Furthermore, our DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics, surpassing published V2A models.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, National Research Council Canada, Shanghai Jiao Tong University, The University of Warwick
The paper presents V2A-DPO, a novel framework for optimizing video-to-audio generation that aligns audio outputs with human preferences through innovative scoring and optimization strategies. This research significantly contributes to the field by addressing critical limitations in existing models and demonstrating state-of-the-art performance through rigorous experimentation.
The proposed V2A-DPO framework innovatively combines Direct Preference Optimization with a comprehensive scoring system (AudioScore) and a curriculum learning strategy, effectively addressing the limitations of existing video-to-audio generation models. The integration of human preference alignment into the optimization process is particularly noteworthy, as it enhances the perceptual quality and aesthetic appeal of generated audio. The methodology is well-structured, with clear definitions of the scoring metrics and a robust pipeline for generating preference pairs, which is crucial for training the models effectively.
The experiments conducted on the VGGSound dataset are thorough and demonstrate the effectiveness of the proposed method against state-of-the-art models. The results show significant improvements in multiple evaluation metrics, validating the authors' claims. The comparison against both DDPO and pre-trained baselines provides a solid foundation for the reported advancements. However, the paper could benefit from more extensive ablation studies to further dissect the contributions of each component in the proposed framework.
The paper provides sufficient implementation details, including the training setup, hyperparameters, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers. Providing a link to the code would greatly improve the reproducibility of the results.
One limitation is the reliance on a specific dataset (VGGSound), which may limit the generalizability of the findings to other domains or datasets. Additionally, while the AudioScore scoring system is comprehensive, its effectiveness in diverse contexts beyond the training data remains to be validated. The paper also does not address potential biases in the human preference annotations, which could affect the training process.
The advancements in video-to-audio generation have significant implications for multimedia content creation, accessibility, and entertainment industries. By aligning audio generation with human preferences, the proposed framework could enhance user experiences in applications such as film, gaming, and virtual reality. Furthermore, the methodologies developed could inspire future research in multimodal machine learning and preference learning, paving the way for more sophisticated generative models. The paper presents V2A-DPO, a novel framework for optimizing video-to-audio generation that aligns audio outputs with human preferences through innovative scoring and optimization strategies. This research significantly contributes to the field by addressing critical limitations in existing models and demonstrating state-of-the-art performance through rigorous experimentation.
Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant gap for deployment on non-CUDA computing infrastructures. In this paper, we present OSUM-Pangu, a fully open-source speech understanding foundation model developed on a completely non-CUDA software and hardware stack. By integrating an audio encoder with the openPangu-7B LLM backbone, we successfully implement the entire training and inference pipeline on the Ascend NPU platform. To facilitate efficient task alignment under non-CUDA resource constraints, we adopt a practical training process that sequentially bridges speech perception and user intent recognition. Experimental results demonstrate that OSUM-Pangu achieves task accuracy comparable to mainstream GPU-based models while maintaining robust natural language interaction capabilities. Our work provides a reproducible, non-CUDA baseline for the open-source speech community, promoting the independent evolution of multimodal intelligence.
Primary: Speech and Language Processing Group
All Institutions: Speech and Language Processing Group
The main contribution of this paper is the development of OSUM-Pangu, an open-source multidimensional speech understanding framework optimized for non-CUDA environments. This work significantly advances the field by providing a viable alternative to GPU-centric models, facilitating the evolution of multimodal intelligence in diverse computing infrastructures.
The methodology presented in OSUM-Pangu is robust, integrating an audio encoder with the openPangu-7B LLM backbone to create a non-CUDA framework for speech understanding. The multi-stage training pipeline is well-structured, allowing for efficient task alignment and robust instruction following. The use of a modality adapter to bridge acoustic and linguistic processing is innovative, although the reliance on fixed task tags in the initial training stages may limit flexibility.
The experiments are comprehensive, utilizing a variety of datasets and demonstrating competitive performance against mainstream GPU-based models. The Instruction Following Rate (IFR) metric is a valuable addition, providing insights into the model's ability to interpret natural language instructions. However, the paper could benefit from more detailed comparisons with existing models, particularly in terms of real-world applicability.
The implementation details are adequately described, and the use of open-source components enhances reproducibility. However, the lack of a publicly available demo or clear instructions for reproducing the experiments may hinder broader adoption.
One limitation is the potential dependency on the Ascend NPU infrastructure, which may not be accessible to all researchers. Additionally, while the model shows strong performance, it may not generalize well to all speech understanding tasks, particularly those requiring extensive multimodal training.
OSUM-Pangu has significant implications for the development of speech understanding systems in non-CUDA environments, promoting accessibility and encouraging further research in open-source frameworks. Its approach may inspire future work in multimodal AI, particularly in settings where GPU resources are limited. The main contribution of this paper is the development of OSUM-Pangu, an open-source multidimensional speech understanding framework optimized for non-CUDA environments. This work significantly advances the field by providing a viable alternative to GPU-centric models, facilitating the evolution of multimodal intelligence in diverse computing infrastructures.
Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.
Primary: Central University
All Institutions: Central University, Applied AI Institute, City University of Hong Kong, Trusted AI Research Center
The paper presents a robust framework for verifying the resilience of voice anti-spoofing models against emerging generative threats. Its comprehensive methodology and experimental validation position it as a valuable contribution to the field, although improvements in reproducibility and clarity would enhance its impact.
The proposed PV-VASM framework introduces a novel probabilistic approach to verify the robustness of voice anti-spoofing models against various transformations, including text-to-speech and voice cloning. The methodology is model-agnostic and leverages probabilistic concentration inequalities to derive a theoretical upper bound on misclassification probabilities. The authors provide a detailed mathematical formulation, which is commendable, but the complexity may hinder understanding for practitioners. The approach is well-grounded in existing literature, addressing a significant gap in the certification of voice anti-spoofing systems.
The experimental validation is extensive, covering a wide range of transformations and generative models. The authors utilize a combination of datasets, including ASVspoof and other open-source collections, to evaluate the robustness of their framework. The results demonstrate meaningful robustness certificates and highlight the framework's applicability in real-world scenarios. However, the paper could benefit from clearer visualizations of results and comparisons with existing methods to better illustrate the advantages of PV-VASM.
While the paper provides a comprehensive description of the methodology and experimental setup, it lacks specific implementation details and code availability, which are crucial for reproducibility. The absence of a project URL further complicates efforts to replicate the findings. Providing a GitHub repository or supplementary materials would greatly enhance the reproducibility of the results.
The paper acknowledges certain limitations, such as the potential over-conservativeness of the upper bounds and the varying performance of the framework against different types of perturbations. Additionally, the complexity of the methodology may pose challenges for practical implementation in real-world applications. The authors also note that the robustness against generative models is less effective, indicating a need for further refinement.
The proposed framework has significant implications for the field of audio security, particularly in enhancing the robustness of voice anti-spoofing systems. As generative models become increasingly sophisticated, the ability to certify the robustness of these systems is crucial for preventing misuse. The research contributes to the ongoing discourse on security in machine learning and could influence future developments in anti-spoofing technologies. The paper presents a robust framework for verifying the resilience of voice anti-spoofing models against emerging generative threats. Its comprehensive methodology and experimental validation position it as a valuable contribution to the field, although improvements in reproducibility and clarity would enhance its impact.
The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong, Applied AI Institute, Fusion Brain Lab, Trusted AI Research Center
The paper presents a novel approach to speech deepfake detection by integrating human-inspired reasoning with LALMs, addressing both detection performance and interpretability. This contribution is particularly relevant in the context of increasing concerns over audio deepfakes and their potential misuse, making the research timely and impactful in the field of machine learning and audio processing.
The proposed HIR-SDD framework innovatively integrates Large Audio Language Models (LALMs) with human-inspired reasoning through a novel dataset of human-annotated reasoning traces. This combination aims to enhance both the detection capabilities and interpretability of speech deepfake detection systems. The methodology is well-structured, utilizing hard-label and chain-of-thought (CoT) pipelines, and incorporates reinforcement learning to improve reasoning quality. However, while the approach is sound, the reliance on LALMs may introduce challenges related to generalization and robustness, particularly against high-fidelity deepfake audio that was not present in the training data.
The experimental evaluation is thorough, demonstrating the effectiveness of the HIR-SDD framework against conventional models like Wav2Vec2-AASIST. The paper provides detailed metrics such as accuracy, balanced accuracy, and F1 scores, and compares the performance of different training strategies. However, the results indicate that while the proposed model shows competitive performance, it still struggles with modern high-fidelity synthesis systems. The evaluation of reasoning quality through external models like GPT-5.1 adds credibility, but the overall improvement in reasoning quality remains modest.
The paper outlines the methodology and datasets used, but lacks sufficient detail on the implementation of the models and training procedures to ensure full reproducibility. While it mentions the use of specific models and training parameters, the absence of a publicly available code repository or demo limits the ability for others to replicate the results. The mention of future work to refine evaluation and stability suggests ongoing developments that may not yet be fully realized.
The primary limitations include the model's struggle with generalization to unseen high-fidelity deepfake audio, which is a critical aspect for practical applications. Additionally, the reasoning quality, while improved, does not show significant enhancements over traditional methods, indicating potential areas for further research. The dataset, while novel, may also be limited in scope, as it primarily focuses on English and Russian audio, which could affect the model's applicability to other languages or dialects.
The implications of this research are significant, particularly in areas where audio authenticity is critical, such as security and biometrics. By improving the interpretability of deepfake detection systems, the proposed framework could enhance trust in automated systems that rely on audio verification. The integration of human-inspired reasoning may also pave the way for more transparent AI systems in various domains, fostering greater public confidence in AI technologies. The paper presents a novel approach to speech deepfake detection by integrating human-inspired reasoning with LALMs, addressing both detection performance and interpretability. This contribution is particularly relevant in the context of increasing concerns over audio deepfakes and their potential misuse, making the research timely and impactful in the field of machine learning and audio processing.
Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. Our model employs a post-encoder separator with serialized CTC to produce talker-ordered transcripts, and leverages an adapted LLM-based SOT objective as a multi-talker-aware teacher signal to explicitly regularize mixed-speech representations. To further support variable numbers of talkers, we introduce a Talker-Count Head that predicts the talker count and dynamically selects the appropriate decoding branch. Experiments on LibriMix show that the proposed encoder-only model achieves comparable performance to LLM-based systems in the two-talker condition, while delivering significant improvements in the three-talker condition with significant small RTF.
Primary: SB Intuitions
All Institutions: SB Intuitions
This paper presents a novel encoder-only framework for multi-talker ASR that distills semantic guidance from LLMs during training, achieving competitive performance while maintaining inference efficiency. The integration of a Talker-Count Head and the focus on serialized CTC decoding represent meaningful advancements in the field, addressing key challenges in multi-talker speech recognition.
The proposed methodology effectively shifts the role of large language models (LLMs) from a computationally expensive decoder to a teacher for training an encoder-only multi-talker ASR system. The integration of a Talker-Count Head (TCH) to dynamically adapt to varying numbers of talkers is a notable innovation that addresses a common limitation in existing systems. The use of serialized CTC for efficient inference while maintaining semantic guidance through distillation is well-conceived, though the paper could benefit from clearer descriptions of the training process and hyperparameter choices.
The experiments conducted on the LibriMix dataset provide a solid foundation for evaluating the proposed model's performance. The results demonstrate that the encoder-only model achieves comparable performance to LLM-based systems in two-talker scenarios and outperforms them in three-talker conditions, showcasing the effectiveness of the proposed approach. However, the paper lacks detailed comparisons with more recent state-of-the-art methods, which could strengthen its claims.
While the paper outlines the model architecture and training phases, it does not provide sufficient implementation details or code availability, which may hinder reproducibility. Including a link to a code repository or supplementary materials would enhance the paper's reproducibility.
The reliance on a fixed number of encoder layers and the challenges associated with talker-count accuracy in three-talker conditions are notable limitations. Additionally, the paper does not address potential performance degradation in more complex or noisy environments beyond those tested.
The proposed framework has significant implications for real-world applications in automatic speech recognition, particularly in environments with overlapping speech. By improving the efficiency and accuracy of multi-talker ASR systems, this research could enhance communication technologies, accessibility tools, and voice-activated systems in various domains. This paper presents a novel encoder-only framework for multi-talker ASR that distills semantic guidance from LLMs during training, achieving competitive performance while maintaining inference efficiency. The integration of a Talker-Count Head and the focus on serialized CTC decoding represent meaningful advancements in the field, addressing key challenges in multi-talker speech recognition.
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.
Primary: unknown
All Institutions: unknown
MoXaRt presents a novel approach to audio-visual sound interaction in XR, significantly improving speech intelligibility and cognitive load management. The integration of visual cues with audio processing represents a meaningful advancement in the field, although further work is needed to enhance reproducibility and explore broader applications.
The methodology presented in MoXaRt is innovative, utilizing a cascaded architecture that combines audio-only separation with visual detection to enhance sound interaction in XR environments. The dual approach of coarse separation followed by refinement using visual cues is a novel integration that addresses the challenges of complex acoustic environments. However, the paper could benefit from a more detailed description of the algorithms used in the cascaded architecture and the specific techniques for visual detection and refinement.
The evaluation is robust, featuring a new dataset specifically designed for the study, which includes 30 one-minute recordings of concurrent speech and music. The user study with 22 participants provides empirical evidence of the system's effectiveness, demonstrating significant improvements in speech intelligibility and cognitive load reduction. The statistical significance of the results (p < 0.01 and p < 0.001) adds credibility to the findings, although further details on participant demographics and experimental controls would strengthen the evaluation.
The paper lacks sufficient implementation details that would allow for full reproducibility of the results. While the architecture is described, specific parameters, training procedures, and the dataset's characteristics are not thoroughly documented. Providing a code repository or supplementary materials would enhance reproducibility.
One limitation is the relatively small dataset size, which may affect the generalizability of the findings. Additionally, the system's performance in more diverse acoustic environments or with different types of sound sources has not been explored. The processing latency of ~2 seconds may also be a concern for real-time applications in XR.
The implications of MoXaRt are significant for the fields of XR and audio processing, as it addresses a critical challenge in creating immersive and socially engaging experiences. By improving speech intelligibility in complex environments, this research could enhance communication in various applications, from virtual meetings to gaming, thereby influencing user experience and interaction design in XR. MoXaRt presents a novel approach to audio-visual sound interaction in XR, significantly improving speech intelligibility and cognitive load management. The integration of visual cues with audio processing represents a meaningful advancement in the field, although further work is needed to enhance reproducibility and explore broader applications.