We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Sony AI, Georgia Tech, KAIST, Peking University, QMUL
TuneJury presents a novel approach to music generation preference alignment through a pairwise reward model, demonstrating competitive performance with a lean architecture and practical applications in real-world scenarios. The comprehensive evaluation and innovative calibration method position this work as a meaningful contribution to the field of machine learning in audio.
The methodology introduces TuneJury as a pairwise reward model for text-to-music generation, leveraging a small MLP head over frozen audio and text encoders. The choice of a pairwise approach is well-justified, addressing the limitations of absolute scoring systems in subjective domains like music. The model is trained on a diverse set of human-rated pairs, which enhances its robustness and generalizability. The introduction of anchor calibration as a post-hoc adjustment method is a notable innovation that allows for adaptation to new systems without the need for retraining, showcasing a practical approach to real-world application.
The experimental evaluation is comprehensive, utilizing multiple datasets and benchmarks to assess the performance of TuneJury. The authors provide detailed comparisons against existing models, demonstrating that TuneJury achieves competitive accuracy with fewer parameters and without relying on pseudo-label augmentation. The results are statistically significant, with clear metrics reported for pairwise accuracy and calibration, as well as downstream applications. The experiments effectively illustrate the model's capabilities across different scenarios, including inference-time selection and latent optimization.
The paper includes sufficient details on the training procedure, architecture, and datasets used, which enhances reproducibility. The authors have made the code, checkpoints, and demo available, which is a strong point for enabling other researchers to replicate their findings. However, some hyperparameter settings and specific implementation details could be more explicitly stated to further aid reproducibility.
The paper acknowledges several limitations, including potential biases in the training data, particularly the lack of representation for vocal music and the calibration signal's dependence on the specific datasets used. The performance drop on post-cutoff splits indicates that the model may not generalize well to newer music generation systems, which could limit its applicability in rapidly evolving contexts.
TuneJury has the potential to significantly impact the field of music generation by providing a more aligned and efficient method for evaluating generated music against human preferences. Its open-source nature encourages community engagement and further research, potentially leading to advancements in multimodal systems that combine text and audio understanding. The implications for music generation tools and applications in creative industries are substantial, as this model could enhance user experience and satisfaction in automated music creation. TuneJury presents a novel approach to music generation preference alignment through a pairwise reward model, demonstrating competitive performance with a lean architecture and practical applications in real-world scenarios. The comprehensive evaluation and innovative calibration method position this work as a meaningful contribution to the field of machine learning in audio.
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Meanwhile, Mamba, a recent state-space model, has emerged as a promising alternative to Transformers for sequence modeling. In this work, we present MambAdapter, a parameter-efficient transfer learning approach that integrates Mamba into low-rank bottleneck adapters. Our design combines parameter sharing across adapters with the injection of a lightweight Mamba module, enabling more effective modeling of audio features. We demonstrate that MambAdapter matches or outperforms strong PETL baselines on four audio classification tasks and five speech recognition languages, even when operating under reduced parameter budgets.
Primary: Université de Montréal
All Institutions: Université de Montréal, Imperial College London, Concordia University, Mila -- Quebec AI Institute
The main contribution of this work is the introduction of MambAdapter, a novel parameter-efficient transfer learning method that combines Mamba's state-space modeling with low-rank bottleneck adapters, achieving competitive performance on audio and speech tasks while significantly reducing the number of trainable parameters. This paper represents a meaningful advancement in the quest for efficient model adaptation in the rapidly evolving field of audio processing.
The paper introduces MambAdapter, which innovatively integrates Mamba, a state-space model, into low-rank bottleneck adapters for parameter-efficient transfer learning in speech and audio tasks. The methodology is well-grounded in existing literature, leveraging the strengths of Mamba's linear-time modeling capabilities while addressing the inefficiencies of traditional Transformer fine-tuning. The use of shared projections and the lightweight Mamba module is a thoughtful design choice that enhances the model's ability to capture long-range dependencies in audio data.
The experimental setup is robust, with comprehensive evaluations across multiple audio classification tasks and multilingual speech recognition. The authors provide a clear comparison against established PETL baselines, demonstrating that MambAdapter achieves competitive or superior performance while maintaining a lower parameter budget. The results are statistically validated through averaging over multiple random seeds, which adds credibility to their findings.
The paper includes a link to the code repository, which is essential for reproducibility. However, the paper could benefit from more detailed hyperparameter settings and training configurations to facilitate easier replication of results by other researchers.
While the paper presents promising results, it does not extensively explore the limitations of MambAdapter, such as potential performance degradation in extremely low-resource settings or the impact of varying audio characteristics on model performance. Additionally, the focus on a limited number of datasets may restrict the generalizability of the findings.
The integration of Mamba into PETL frameworks has significant implications for the field of audio and speech processing, particularly in resource-constrained environments. The findings could influence future research directions in efficient model adaptation, potentially leading to advancements in real-time speech recognition and audio classification applications. The main contribution of this work is the introduction of MambAdapter, a novel parameter-efficient transfer learning method that combines Mamba's state-space modeling with low-rank bottleneck adapters, achieving competitive performance on audio and speech tasks while significantly reducing the number of trainable parameters. This paper represents a meaningful advancement in the quest for efficient model adaptation in the rapidly evolving field of audio processing.
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{https://github.com/cactuswiththoughts/DDPO-VC}{https://github.com/cactuswiththoughts/DDPO-VC}} and demo\footnote{\href{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}}.
Primary: MIT CSAIL
All Institutions: MIT CSAIL, Boston University
The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
The proposed DDPO-VC framework effectively integrates reinforcement learning with diffusion models to address the dual challenge of privacy and utility in speaker de-identification. The methodology is well-structured, leveraging a conditional diffusion model and a novel reward mechanism that utilizes both privacy and utility teachers. This innovative approach allows for a more nuanced optimization of the privacy-utility tradeoff, which is critical in sensitive applications such as healthcare. The use of reinforcement learning to navigate complex correlations between variables is a significant advancement over traditional disentanglement methods.
The experiments are robust, utilizing two dementia speech benchmarks that are relevant and challenging. The results demonstrate clear superiority over existing methods in both privacy preservation and cognitive utility, with well-defined metrics such as AUC and EER. The comprehensive evaluation across multiple settings (zero-shot and fine-tuned) adds credibility to the findings. However, further details on the datasets and the specific configurations used in experiments would enhance the clarity of the evaluation.
The paper provides a GitHub repository and demo link, which is a positive aspect for reproducibility. However, the implementation details could be more explicit, particularly regarding hyperparameters and training procedures, to ensure that other researchers can replicate the results accurately.
One limitation noted is the potential for reward hacking due to the fixed nature of the privacy teacher. Additionally, the reliance on pretrained models for the privacy and utility teachers may limit the generalizability of the approach to other domains. The paper also acknowledges the need for more diverse evaluation metrics beyond naturalness and speaker similarity, indicating room for improvement in the evaluation framework.
The implications of this research are significant, particularly in fields where privacy is paramount, such as healthcare. By improving speaker de-identification methods, the framework can help protect sensitive information while still allowing for the utility of speech data in applications like dementia diagnosis and monitoring. The potential for broader applications in other audio domains and utility variables further enhances its relevance. The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and speaker similarity, but this has limited sensitivity and reliability. We propose an evaluation framework with subjective and objective components. Subjectively, we evaluate perceived intelligibility and speaker identity using Best Worst Scaling (BWS) with situational framing. Objectively, we demonstrate that standard measures fail to predict reconstruction success for highly unintelligible speakers, so we introduce a novel dual-reference distributional measure to assess the trade-off between intelligibility and speaker identity. By evaluating the output of 17 zero-shot TTS systems for 193 speakers, we show that our framework provides a reliable and task-aligned approach for assessing voice reconstruction.
Primary: The Centre for Speech Technology Research, University of Edinburgh
All Institutions: The Centre for Speech Technology Research, University of Edinburgh
This paper presents a rigorous and innovative evaluation framework for TTS voice reconstruction, introducing situational framing in subjective evaluation and a novel dual-reference distributional metric that effectively captures the trade-off between intelligibility and speaker identity, addressing critical gaps in current assessment methodologies for assistive speech technologies.
The paper proposes a comprehensive evaluation framework for Text-to-Speech (TTS) voice reconstruction, a task critical for assisting individuals with speech disorders. The methodology addresses two main gaps in current evaluation practices: the limitations of Mean Opinion Score (MOS) in sensitivity and reliability, and the failure of standard objective metrics to correlate with human perception in this specific domain. Subjectively, the authors employ Best Worst Scaling (BWS) with situational framing to isolate intelligibility from speaker identity reconstruction, a nuanced approach that acknowledges the distinct cognitive tasks involved. Objectively, they introduce a novel dual-reference distributional measure (TTSDS Mean) that combines distances to a high-intelligibility generic corpus and the original disordered speaker's reference. This approach attempts to quantify the trade-off between improving intelligibility and preserving speaker identity, addressing the lack of ground truth in this generative task. The methodological rigor is high, particularly in the experimental design of the subjective study and the innovative application of distributional metrics to a domain where they have not been standardly applied.
The experimental setup is robust, involving 17 zero-shot TTS systems evaluated on 193 speakers from the Speech Accessibility Project (SAP) dataset. The dataset covers diverse speech disorders (Parkinson's, Cerebral Palsy, ALS, Down Syndrome), enhancing the generalizability of the findings. The subjective evaluation involved a significant number of listeners (46-47 per condition) and used rigorous statistical modeling (Plackett-Luce). The results clearly demonstrate that standard objective metrics (WER, PER, Speaker Similarity, UTMOS) fail to predict reconstruction success, particularly for highly unintelligible speakers. The proposed TTSDS Mean metric shows strong correlation with subjective reconstruction rankings (rho=0.81 overall), outperforming speaker similarity (rho=0.75). The analysis of system performance reveals that while some systems (IndexTTS2, Qwen3-TTS) perform well on average, they struggle with severely disordered speech, highlighting the complexity of the task. The large-scale evaluation provides a solid empirical basis for the proposed framework.
The paper provides detailed information on the datasets, systems, and evaluation protocols. The use of the public SAP dataset and well-known TTS systems enhances reproducibility. The authors provide a project page with audio examples and listening test instructions, which is crucial for verifying the subjective findings. However, the code for the specific implementation of the TTSDS Mean metric and the exact configuration of the 17 TTS systems (especially if they are open-source but require specific versions) might need clarification. The description of the BWS experimental design is sufficiently detailed for replication. The lack of a public code repository is a minor drawback, but the availability of stimuli and detailed methodology mitigates this.
The study focuses primarily on intelligibility and speaker identity, leaving out other important dimensions such as prosody, accent similarity, and naturalness, which are acknowledged as future work. The subjective evaluation relies on listeners who are not the end-users (people with speech disorders), which may introduce bias or lack of ecological validity. The distributional measure relies on the quality of the reference datasets (LibriTTS and the original disordered speech), and its performance may vary with different TTS architectures or training data distributions. The generalization of the TTSDS Mean metric to other languages or speech disorders not covered in the study is not tested. Additionally, the use of zero-shot TTS systems limits the scope to cloning-based approaches, excluding fine-tuned or specialized reconstruction models.
This work has significant potential impact on the development of assistive communication technologies. By providing a more reliable and task-aligned evaluation framework, it can guide researchers and developers in creating better voice reconstruction systems for people with speech disorders. The findings challenge the reliance on standard TTS metrics and advocate for more nuanced, use-case-specific evaluation methods. The framework can be adopted by the broader TTS and speech accessibility communities to standardize evaluations and facilitate fair comparisons between systems. Ultimately, this contributes to improving the quality of life for individuals with speech impairments by enabling more effective and personalized communication aids. This paper presents a rigorous and innovative evaluation framework for TTS voice reconstruction, introducing situational framing in subjective evaluation and a novel dual-reference distributional metric that effectively captures the trade-off between intelligibility and speaker identity, addressing critical gaps in current assessment methodologies for assistive speech technologies.
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We reformulate audio editing as a rich-caption rewriting task by treating a rich caption as the semantic representation of an audio clip. The user request is translated into an edited caption, which then guides Bagpiper-Edit to generate the target edited audio with the original audio as contextual acoustic anchor. This unlocks the potential of free-form editing, and circumvents the need for paired audio-editing training data, enabling powerful zero-shot editing capabilities. Evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency to the original audio and achieves similar performance to other expert models in most cases. Demo: https://bagpiper-edit.github.io, Codes: https://github.com/espnet/espnet/pull/6417 & https://github.com/HsunGong/espnet
Primary: Shanghai Jiao Tong University
All Institutions: Auditory Cognition and Computational Acoustics Lab, Shanghai Jiao Tong University, Carnegie Mellon University, Language Technologies Institute
Bagpiper-Edit introduces a novel self-supervised framework for zero-shot open-ended audio editing by reformulating the task as rich-caption rewriting, effectively eliminating the need for paired editing datasets while maintaining high acoustic consistency.
The paper proposes Bagpiper-Edit, a framework that reformulates open-ended audio editing as a rich-caption rewriting task. The core innovation lies in bypassing the need for paired audio-editing training data by introducing a self-supervised training paradigm. This involves two strategies: audio repetition (to teach timbre/background preservation) and audio segmentation (to teach continuity across adjacent clips). The method leverages a pre-trained audio foundation model (Bagpiper-Base) and uses an external LLM to rewrite the rich caption based on user instructions, which then guides the audio generation conditioned on the original audio as an "acoustic anchor." The approach is theoretically sound and addresses a significant bottleneck in the field: the scarcity of high-quality paired editing datasets. The use of contiguous audio segments for self-supervision is a clever and resource-efficient technique for learning acoustic consistency.
The evaluation covers three domains: speech editing, audio-event editing, and free-form editing. The authors compare Bagpiper-Edit against specialized expert models (CosyVoice-3, Ming-UniAudio-Edit, Step-Audio-EditX) and generative models (AudioLDM2). Results show that Bagpiper-Edit (specifically the Multi-Turn variant) achieves competitive performance, often surpassing expert models in speaker similarity preservation and matching them in naturalness and editing accuracy. The metrics used (WER, SpkSIM, DNSMOS, FAD, CLAP scores) are standard and appropriate. The inclusion of LLM-based human preference scoring adds a layer of subjective evaluation, though reliance on LLMs for judgment can be noisy. The ablation between Single-Turn and Multi-Turn patterns provides useful insight into the training dynamics.
The paper provides links to a demo and a GitHub PR for the code. The training data sources (YODAS, LAION-Audio, etc.) are publicly available. The self-supervised training strategy is clearly described, allowing for potential reproduction. However, the reliance on specific large LLMs (Qwen3-235B, Qwen3-8B) for caption extraction and rewriting introduces external dependencies that might vary in performance depending on the version or API access. The exact hyperparameters for the self-supervised phase are partially detailed, but the full pipeline complexity might make exact replication challenging without the complete codebase.
The paper acknowledges that as a zero-shot model, it may exhibit less stability than expert models trained on massive paired data. It also notes limitations in handling highly complex acoustic environments like multi-speaker separation. The reliance on an external LLM for caption rewriting means that errors in the LLM's understanding or rewriting can propagate to the audio generation. Additionally, the "rich caption" representation, while powerful, may lose some fine-grained acoustic details present in the original audio if the captioning process is not perfectly lossless.
This work significantly lowers the barrier for open-ended audio editing by removing the dependency on expensive paired datasets. It enables a more natural, free-form interface for audio manipulation, which has broad applications in content creation, accessibility, and audio post-production. The self-supervised approach could inspire similar techniques in other modalities where paired data is scarce. However, the ease of editing audio also raises concerns about deepfakes and misinformation, necessitating robust watermarking or detection mechanisms, which are not discussed in the paper. Bagpiper-Edit introduces a novel self-supervised framework for zero-shot open-ended audio editing by reformulating the task as rich-caption rewriting, effectively eliminating the need for paired editing datasets while maintaining high acoustic consistency.
Speaker-decoupled speech codecs can reduce bitrate by separating global speaker attributes from local content and prosody, while supporting voice conversion. Existing speaker-decoupled codecs face a trade-off: methods that explicitly suppress speaker leakage often rely on multi-stage or auxiliary training, whereas simpler designs can leave residual speaker information in local tokens. We propose SDP-Codec, a speaker-decoupled, pitch-injected codec trained with a single-stage optimization pipeline. SDP-Codec derives local tokens from continuous pre-quantization features of a pretrained self-supervised encoder and injects normalized F0 via a pitch encoder-decoder with global-conditioned denormalization and soft-label pitch reconstruction objective. Across 16 kHz and 24 kHz settings, SDP-Codec achieves competitive reconstruction and strong zero-shot voice conversion at comparable bitrates, with the lowest speaker-probing accuracy among compared systems, suggesting reduced speaker leakage.
Primary: Graduate School of Culture Technology, KAIST
All Institutions: Graduate School of Culture Technology, KAIST
SDP-Codec presents a compelling single-stage approach to speaker-decoupled speech coding, effectively balancing reconstruction quality and voice conversion capability through innovative pitch injection and soft-label loss mechanisms. The rigorous evaluation, including speaker probing and comprehensive ablation, strengthens the claim that the method successfully reduces speaker leakage in local tokens while preserving content and prosody, offering a practical solution for low-bitrate neural speech codecs.
The paper proposes SDP-Codec, a speaker-decoupled speech codec designed to minimize bitrate while supporting zero-shot voice conversion. The core methodological contribution lies in the decoupling strategy: it uses a single-stage optimization pipeline where local tokens are derived from continuous pre-quantization features of a pretrained vq-wav2vec encoder, rather than the quantized units themselves. To address the loss of prosodic information inherent in this compression, the authors inject normalized F0 via a dedicated pitch encoder-decoder. A key innovation is the use of a soft-label pitch reconstruction objective (Gaussian-blurred one-hot bins) and global-conditioned denormalization, which allows the global branch to handle speaker-dependent pitch range while the local branch handles content and normalized prosody. The global branch utilizes WavLM features compressed via a perceiver resampler to provide time-invariant speaker embeddings. This design attempts to resolve the trade-off between explicit speaker suppression (complex training) and residual leakage (poor VC).
The evaluation covers reconstruction quality (UTMOS, WER, STOI) and zero-shot voice conversion (SECS, F0 correlation, NMOS, SMOS) at 16 kHz and 24 kHz. The paper compares SDP-Codec against strong baselines including LSCodec, BiCodec, MSRCodec, and EZ-VC. Results indicate that SDP-Codec achieves competitive reconstruction quality, particularly in STOI and F0 correlation, while demonstrating superior speaker similarity in VC tasks compared to other codec-based systems. The speaker-probing accuracy experiment is a strong addition, quantitatively demonstrating reduced speaker leakage in the local tokens compared to competitors. The ablation studies effectively validate the components, showing that the soft-label pitch loss and the use of continuous pre-quantization features are critical for performance.
The paper provides source code and a demo link. Training details are provided, including dataset splits (LibriSpeech, LibriTTS, MLS), training steps, and hardware. However, the use of "internal" data for some baselines (Vevo) and the specific configuration of baselines (e.g., merging thresholds) to match bitrates introduces some variability. The description of the architecture is sufficiently detailed for replication, and the frozen pretrained components (vq-wav2vec, WavLM, FCPE) are standard.
The paper acknowledges that content fidelity (WER) remains a limitation, trading some reconstruction quality for VC performance. The reliance on a pretrained pitch extractor (FCPE) introduces a dependency on external tools for feature extraction, which might not be perfectly aligned with the codec's internal representations. The evaluation is primarily on English datasets (LibriSpeech, LibriTTS, MLS), limiting the assessment of multilingual robustness, although the authors mention Indic languages in the table header, the main text focuses on English. The single-stage training, while simpler, might not achieve the same level of disentanglement as multi-stage methods that explicitly optimize for speaker suppression.
SDP-Codec contributes to the field of efficient speech representation learning, enabling low-bitrate transmission and flexible voice conversion. This has implications for telecommunication, speech language models (SLMs) by providing compact tokens, and creative applications in voice cloning and conversion. The reduced speaker leakage is a positive step towards more ethical and controllable voice conversion systems. However, the potential for misuse in deepfake generation remains a concern, though the low bitrate and specific use case mitigate this slightly compared to high-fidelity generative models. SDP-Codec presents a compelling single-stage approach to speaker-decoupled speech coding, effectively balancing reconstruction quality and voice conversion capability through innovative pitch injection and soft-label loss mechanisms. The rigorous evaluation, including speaker probing and comprehensive ablation, strengthens the claim that the method successfully reduces speaker leakage in local tokens while preserving content and prosody, offering a practical solution for low-bitrate neural speech codecs.
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.
Primary: National Taiwan University
All Institutions: University of Michigan, National Taiwan University, Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, National Taiwan University Artificial Intelligence Center of Research Excellence
This paper presents a novel MoE-based speaker verification framework that effectively bridges the domain gap between verbal and non-verbal vocalizations, significantly improving cross-domain verification performance while mitigating catastrophic forgetting through conditional distillation.
The paper proposes a Mixture of Experts (MoE) framework integrated with a frozen Data2Vec self-supervised learning (SSL) front-end and an ECAPA-TDNN backend to address speaker verification (SV) across verbal and non-verbal vocalizations (NVVs). The core methodological innovation lies in the combination of three components: (1) an Inter-Layer Residual MoE (IR-MoE) architecture that routes features to specialized experts for speech vs. NVV, governed by learned domain-aware constraints (entropy maximization, KL divergence for intra-event consistency, and cosine-margin for inter-event separation); (2) a conditional knowledge distillation loss using a frozen WavLM teacher to prevent catastrophic forgetting of speech capabilities during NVV adaptation; and (3) a supervised contrastive loss that bridges the domain gap by enforcing shared speaker manifolds across speech and NVV pairs. The approach is technically sound, leveraging established SSL representations while introducing architectural and loss-function modifications to handle the heterogeneity of non-phonemic acoustic signals. The use of MoE for domain separation in SV is a novel application, moving beyond standard fine-tuning paradigms.
The experimental evaluation is comprehensive, utilizing the NonverbalTTS dataset with 10 distinct NVV categories. The authors provide a systematic analysis of domain mismatch, demonstrating the significant performance drop of zero-shot SV models on NVVs (EER ~39%). They compare against multiple baselines, including standard SSL models (WavLM, Data2Vec, Voc2Vec) and fusion strategies. The proposed IR-MoE model achieves a substantial reduction in cross-domain EER (NvS) from 38.93% to 22.66% and improves speech EER (SvS) from 13.17% to 9.24% compared to a fine-tuned baseline. The ablation studies effectively isolate the contributions of the MoE architecture and the conditional distillation loss. The results are statistically significant and demonstrate that the proposed method successfully mitigates catastrophic forgetting while improving NVV verification. The inclusion of mDCF metrics adds robustness to the evaluation.
The paper provides detailed implementation specifications, including the optimizer (Adam), learning rate schedule (cosine annealing), batch composition strategy (speaker-balanced with cross-domain prioritization), and specific loss weights. The use of standard, open-source components (Data2Vec, ECAPA-TDNN, WavLM) enhances reproducibility. The description of the progressive training schedule and the specific constraints for the MoE routing (entropy, KL, cosine-margin) is sufficiently detailed for replication. However, the exact codebase for the MoE integration and the specific preprocessing steps for the NonverbalTTS dataset (beyond MFA alignment) might require careful interpretation, though the overall pipeline is clear.
A primary limitation is the reliance on the NonverbalTTS dataset, which, while diverse, may not capture the full spectrum of NVVs found in real-world expressive TTS/VC systems (e.g., complex emotional laughter, gasps, or synthesized artifacts). The MoE architecture introduces additional parameters and complexity compared to standard fine-tuning, which could impact inference latency, although this is not explicitly measured. The "conditional" distillation relies on the model's ability to correctly route inputs during inference; if the router misclassifies an input (e.g., a speech-like NVV), the distillation benefit might not be fully realized or could even be detrimental if the wrong expert is activated. Furthermore, the performance on NVV-NVV (NvN) verification remains relatively high (EER ~27-28%), indicating that while cross-domain generalization is improved, intra-NVV verification is still challenging.
This work has significant implications for the evaluation and development of expressive TTS and Voice Conversion systems, where NVVs are crucial for naturalness. By providing a reliable SV metric for NVVs, it enables better objective evaluation of these systems, potentially accelerating their adoption in virtual assistants, gaming, and accessibility tools. It also contributes to the broader field of multi-modal and multi-domain speaker recognition, offering a template for handling heterogeneous acoustic inputs. The mitigation of catastrophic forgetting is a generalizable technique for continual learning in speech processing. This paper presents a novel MoE-based speaker verification framework that effectively bridges the domain gap between verbal and non-verbal vocalizations, significantly improving cross-domain verification performance while mitigating catastrophic forgetting through conditional distillation.
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. We audit this gap with a frozen state-of-the-art SSL-AASIST detector trained on ASVspoof 2019 LA. While its in-domain EER is 0.21%, transferring its LA-calibrated threshold to the In-the-Wild corpus yields a half total error rate (HTER) of 39.5%, with 78.7% of bona fide speech rejected, even though the In-the-Wild EER (11.2%) appears moderate. We then test whether popular unlabeled test-time corrections close this gap, and first prove a simple proposition: any strictly increasing score transform, including z-norm, temperature/shift calibration, and embedding mean alignment under a frozen linear head, cannot change EER. An audit of seven corrections on In-the-Wild and ASVspoof 2021 DF confirms the proposition empirically and exposes two further failure modes: AS-norm with an unlabeled target cohort collapses (EER 11.2% to 60.2%), and pseudo-label calibration that reduces HTER by 38% relative on In-the-Wild degenerates to 50% HTER on DF21, whose spoof prior is 96%. No audited correction reduces EER by more than 1% relative. We recommend reporting HTER at a transferred threshold alongside EER.
Primary: Xidian University
All Institutions: Xidian University
This paper provides a crucial theoretical and empirical audit of speech deepfake detection evaluation, proving that common unlabeled score calibrations cannot improve EER and demonstrating the severe deployment failures hidden by standard metrics, thereby advocating for more realistic evaluation protocols in the field.
The paper presents a critical methodological audit of speech deepfake countermeasures (CMs), specifically focusing on the discrepancy between evaluation metrics (EER) and deployment reality (HTER at fixed thresholds). The core theoretical contribution is a rigorous proof of monotone invariance: any strictly increasing transformation of scores (including z-norm, temperature scaling, and embedding mean alignment) cannot alter the Equal Error Rate (EER). This is a significant theoretical insight that debunks the efficacy of many popular "unlabeled test-time corrections" often cited in adjacent fields like speaker verification. The methodology involves freezing a state-of-the-art SSL-AASIST model and systematically applying seven different correction strategies (C1-C7) to evaluate their impact on both in-domain and out-of-domain performance. The approach is clean, logically sound, and effectively isolates the operating-point shift from the score distribution shift. EXPERIMENTAL_EVALIGATION: The experimental evaluation is robust and well-designed. The authors use a frozen SSL-AASIST model trained on ASVspoof 2019 LA and test it on two distinct out-of-domain datasets: In-the-Wild (ITW) and ASVspoof 2021 DF. The results starkly illustrate the paper's thesis: while the in-domain EER is near-zero (0.21%), the transferred threshold leads to catastrophic failure on ITW (HTER 39.5%, 78.7% FRR). The audit of seven corrections confirms the theoretical proposition: monotone methods (C1-C3) fail to improve EER and often worsen HTER, while non-monotone methods (C4-C7) show limited improvement or collapse (e.g., AS-norm with unlabeled cohort collapses to 60.2% EER due to cohort contamination). The analysis of Failure Mode III (prior sensitivity) on DF21, where pseudo-label calibration fails due to a 96% spoof prior, provides deep practical insight into the fragility of current calibration techniques.
The paper provides sufficient detail for reproduction. It specifies the model architecture (SSL-AASIST with wav2vec 2.0 front-end), the specific checkpoint used, the data subsets (verified against official protocols), and the exact procedures for each correction method (e.g., L-BFGS settings for temperature scaling, leave-one-out for AS-norm). The code for the audit protocol is likely available given the arXiv submission context, though not explicitly linked in the text provided. The verification of labels against official protocol files adds a layer of trust to the experimental setup.
The primary limitation is the scope of the audit: it focuses on a single model architecture (SSL-AASIST). While the monotone invariance proposition is model-agnostic, the magnitude of errors for non-monotone methods may vary across different architectures (e.g., CNN-based vs. Transformer-based detectors). Additionally, the audit uses specific subsets of the ASVspoof datasets (4/9 shards for LA, 4/80 for DF21), which may not fully represent the entire dataset's variability. The paper also notes that HTER is a single operating point; applications requiring different cost functions (e.g., t-DCF) might see different absolute numbers, although the qualitative conclusions likely hold.
This paper has significant implications for the speech anti-spoofing community. By highlighting that EER is a misleading metric for deployment readiness, it challenges the status quo of benchmarking in the field. The recommendation to report HTER at transferred thresholds alongside EER could lead to more realistic evaluation standards. The proof that many common calibration techniques are theoretically incapable of improving EER saves researchers from pursuing dead ends and directs attention to non-monotone corrections or fundamental model improvements. This work promotes more rigorous and deployment-aware evaluation practices in audio security and deepfake detection. This paper provides a crucial theoretical and empirical audit of speech deepfake detection evaluation, proving that common unlabeled score calibrations cannot improve EER and demonstrating the severe deployment failures hidden by standard metrics, thereby advocating for more realistic evaluation protocols in the field.
Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most needed. Moreover, we find that text-based reference conditioning can propagate atypical acoustic patterns from atypical speech into synthesis, even when ground-truth transcripts are available. To address this, we propose RTFree-F5, which replaces the reference transcript with continuous self-supervised speech representations mapped into F5-TTS's text-conditioning space via a lightweight adapter, while reusing the pretrained checkpoint. On dysarthric speech, RTFree-F5 reduces WER from 24.6% to 10.4%, surpassing even the ground-truth reference transcript baselines, while improving naturalness and remaining competitive on standard benchmarks without requiring any reference transcript.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology, University of Illinois Urbana-Champaign
RTFree-F5 significantly advances zero-shot TTS by introducing a transcript-free conditioning mechanism using SSL features, demonstrating superior intelligibility for atypical speakers and revealing critical limitations in text-based reference conditioning. The paper presents a robust, well-evaluated solution to a persistent problem in speech synthesis, with clear implications for both technical research and real-world accessibility applications.
The paper proposes RTFree-F5, a novel conditioning mechanism for flow-matching TTS models (specifically F5-TTS) that eliminates the need for reference transcripts. The core innovation is replacing the text-encoder output for the reference audio with a projection of continuous self-supervised learning (SSL) features (from WavLM) into the text-conditioning space. This is a clever architectural modification that leverages the robustness of SSL representations to handle atypical speech (dysarthria, accents) where ASR fails. The two-stage training strategy (aligning the projector, then joint fine-tuning) is methodologically sound and addresses the distribution shift between the original within-utterance infilling objective and the new cross-utterance conditioning setup. The approach is technically elegant as it reuses the heavy pretrained backbone, requiring only a lightweight adapter.
The evaluation is comprehensive, covering both standard zero-shot TTS benchmarks (LibriSpeech-PC, SeedTTS) and challenging atypical speech datasets (SAP for dysarthria, L2-ARCTIC for accents). The results demonstrate significant improvements in intelligibility (WER) for atypical speakers, outperforming even oracle transcript baselines. This is a critical finding that highlights a fundamental flaw in text-based reference conditioning for non-standard speech. The trade-off between intelligibility and speaker similarity (SIM) is honestly reported and analyzed. The inclusion of standard metrics (UTMOS, SIM) alongside WER provides a well-rounded view of performance. The comparison against ASR-based baselines further strengthens the claim that the method is robust to transcription errors.
The paper provides sufficient implementation details for reproduction, including the specific SSL model (WavLM-Large), the architecture of the projector (2-layer MLP), training hyperparameters (learning rates, epochs, optimizer), and the data construction process (cross-utterance pairs from LibriTTS). The use of publicly available models (F5-TTS, WavLM, Vocos) and datasets ensures that the work is reproducible. The code for the projector and training script would likely need to be released, but the architectural description is clear enough to implement.
The primary limitation is the trade-off between intelligibility and speaker similarity. While intelligibility improves, speaker similarity scores drop, suggesting that some unique voice characteristics are lost or altered in the process of normalizing the speech. The authors acknowledge this but suggest it may be acceptable for accessibility applications. Additionally, the method relies on the quality of the SSL encoder; if the SSL features do not capture the necessary speaker identity information as effectively as the original text+acoustic context, performance might suffer for typical speakers (though results show it remains competitive). The method also introduces a slight inference overhead due to the SSL feature extraction and projection, although this is likely negligible compared to the TTS generation itself.
This work has significant positive broader impact, particularly for accessibility. By enabling high-quality, intelligible speech synthesis for individuals with dysarthria or strong accents without requiring perfect transcripts, it removes a major barrier to entry for assistive communication technologies. It democratizes access to zero-shot TTS for populations that are currently underserved by standard systems. The technical contribution also advances the field of multimodal conditioning, demonstrating the utility of SSL features as a robust alternative to text for reference conditioning in generative models. RTFree-F5 significantly advances zero-shot TTS by introducing a transcript-free conditioning mechanism using SSL features, demonstrating superior intelligibility for atypical speakers and revealing critical limitations in text-based reference conditioning. The paper presents a robust, well-evaluated solution to a persistent problem in speech synthesis, with clear implications for both technical research and real-world accessibility applications.
Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter. First, asymmetric temporal padding redistributes past and future context in convolutions, enabling systematic latency configuration. Second, dual-buffer streaming combines state buffers for past context with lookahead buffers that supply future context at both the input and feature levels. Selective state updates also prevent future-frame leakage into the streaming state, ensuring training-inference consistency. On VoiceBank+DEMAND, a fixed-budget (1.37M parameters) backbone yields a family of models spanning 12.5-75.0 ms, with PESQ rising from 3.35 to 3.43. At just 12.5 ms (fully causal), a PESQ of 3.35 matches or exceeds the prior causal state-of-the-art (3.27 at 46.5 ms).
Primary: Pohang University of Science and Technology (POSTECH)
All Institutions: Pohang University of Science and Technology (POSTECH), Intus Co. Ltd.
[One sentence main contribution]. LaCo-SENet introduces a latency-configurable streaming speech enhancement framework using asymmetric temporal padding and selective state updates to prevent future-frame leakage, achieving state-of-the-art causal performance at ultra-low latency. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a robust solution to the latency-quality trade-off in streaming speech enhancement. By decoupling these factors through asymmetric padding and a novel dual-buffer mechanism with selective state updates, the authors enable a single fixed-parameter model to span a wide range of algorithmic latencies. The technical contribution is significant because it addresses a fundamental implementation challenge (state corruption) that has previously hindered flexible streaming designs. The results demonstrate that careful architectural management of context buffers can yield superior causal performance compared to models specifically designed for higher latencies. This approach offers a practical path for deploying high-quality speech enhancement on resource-constrained devices where latency is a hard constraint.
The paper proposes LaCo-SENet, a streaming speech enhancement architecture that decouples algorithmic latency from model quality by introducing asymmetric temporal padding. The core innovation lies in the "dual-buffer" streaming framework, which manages state buffers for past context and lookahead buffers for future context. Crucially, the authors identify and solve the "state corruption" problem inherent in naive asymmetric padding implementations, where future frames recorded in state buffers would leak into subsequent chunks. They resolve this via "selective state updates," ensuring that only current-chunk frames update the recurrent state, thereby maintaining training-inference consistency. The approach is technically sound, leveraging standard convolutional operations (Dense Dilated Depthwise Blocks) but applying them in a novel streaming configuration. The method is elegant in its simplicity: a single hyperparameter (padding ratio) controls the latency-quality trade-off without altering the parameter count or receptive field size.
The evaluation is conducted on the standard VoiceBank+DEMAND dataset. The results are compelling: at 12.5 ms latency (fully causal), LaCo-SENet achieves a PESQ of 3.35, outperforming the previous causal state-of-the-art (aTENNuate at 3.27 PESQ, but with 46.5 ms latency). The model scales gracefully up to 75.0 ms (PESQ 3.43) and 200.0 ms (PESQ 3.47). The ablation study effectively demonstrates the necessity of the selective state update mechanism, showing catastrophic performance drops (below noisy baseline) when it is disabled. The throughput analysis (RTF) is also included, showing real-time capability. However, the evaluation is limited to a single dataset (VoiceBank+DEMAND) and a single metric suite (PESQ, STOI, CSIG, etc.). While the results are strong for the benchmark, the lack of evaluation on more recent, larger-scale datasets (e.g., DNS Challenge, CHiME) limits the generalizability claims.
The paper provides sufficient detail for reproduction. The architecture (PrimeK-Net backbone), hyperparameters (channel dimensions, kernel sizes, STFT parameters), and training loss components are explicitly listed. The mathematical formulation of the asymmetric padding and selective state updates is clear. The code is not provided, but the method relies on standard PyTorch/TensorFlow operations (convolutions, buffering, masking), making implementation feasible for researchers in the field.
The primary limitation is the scope of evaluation. VoiceBank+DEMAND is a relatively small, older dataset with specific noise types. Performance on real-world, non-stationary noise or diverse acoustic environments is not reported. Additionally, while the latency is configurable, the model does not adapt latency dynamically based on input SNR or content; it is fixed per model instance. The PESQ gains at higher latencies (75-200ms) are marginal compared to the latency cost, suggesting diminishing returns for very high lookahead in this specific architecture.
This work has significant implications for real-time audio applications such as teleconferencing, hearing aids, and voice assistants, where low latency is critical but quality cannot be sacrificed. By enabling a single model to serve a range of latency requirements, it simplifies deployment pipelines for edge devices with varying computational constraints. The technique of selective state updates for preventing future-frame leakage is a generalizable contribution to streaming deep learning architectures beyond just speech enhancement. [One sentence main contribution]. LaCo-SENet introduces a latency-configurable streaming speech enhancement framework using asymmetric temporal padding and selective state updates to prevent future-frame leakage, achieving state-of-the-art causal performance at ultra-low latency. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a robust solution to the latency-quality trade-off in streaming speech enhancement. By decoupling these factors through asymmetric padding and a novel dual-buffer mechanism with selective state updates, the authors enable a single fixed-parameter model to span a wide range of algorithmic latencies. The technical contribution is significant because it addresses a fundamental implementation challenge (state corruption) that has previously hindered flexible streaming designs. The results demonstrate that careful architectural management of context buffers can yield superior causal performance compared to models specifically designed for higher latencies. This approach offers a practical path for deploying high-quality speech enhancement on resource-constrained devices where latency is a hard constraint.
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.
Primary: Nagoya Institute of Technology
All Institutions: Nagoya Institute of Technology, LY Corporation
This paper provides a critical diagnostic analysis of automatic MOS prediction models, demonstrating their insensitivity to prosodic errors and misalignment with human perception regarding speaker characteristics, thereby highlighting the need for more perceptually aligned evaluation metrics in speech synthesis research.
The paper employs a rigorous diagnostic methodology to probe the internal representations and output sensitivities of existing automatic MOS prediction models (UTMOS, DNSMOS, NISQA, SHEET-MB, SHEET-BV). By applying controlled, independent perturbations to speech signals—specifically acoustic degradation, prosodic (accent) errors, and speaker characteristics (pitch/rate)—the authors isolate specific quality dimensions. This "ablation-style" evaluation of the *evaluator* rather than the *generator* is a sophisticated approach to understanding model failure modes. The use of Japanese pitch-accent language provides a critical test case for prosodic sensitivity that is often overlooked in English-centric benchmarks. The methodology is sound, well-controlled, and directly addresses the gap between scalar model outputs and multidimensional human perception.
The experimental design is robust. Group A confirms that models generally track acoustic fidelity well (with some exceptions like SHEET-MB on MP3 compression), validating the baseline utility of these models. Group B provides the most striking result: a complete dissociation between human sensitivity to prosodic errors (large MOS drops) and model insensitivity (negligible score changes). This is a significant finding. Group C reveals a "double dissociation" in speaker characteristics: models are biased by mean F0 (likely due to training data distributions) while ignoring F0 variability and speaking rate, which humans notice. The statistical reporting (SRCC, Pearson r, confidence intervals) is adequate. The comparison across six distinct models allows for a comprehensive view of the field's current state.
The paper provides sufficient detail for reproduction. It specifies the datasets (JVS, NANSY-TTS internal data), the perturbation methods (SiFi-GAN for pitch, WORLD for rate), and the models evaluated (using the VERSA toolkit). The specific conditions (e.g., clipping levels, SNR values, accent swap probabilities) are clearly defined. However, the use of an "internal dataset" for the TTS generation in Group B limits full independent reproduction of that specific subset, though the methodology is clear. The VERSA toolkit ensures standard inference, enhancing reproducibility for the model evaluation part.
The primary limitation is the reliance on a single language (Japanese) for the prosodic analysis. While Japanese is a pitch-accent language, it is unclear if these findings generalize to tonal languages (e.g., Mandarin) or non-tonal languages with different prosodic structures (e.g., stress-accent languages like English). Additionally, the study focuses on *naturalness* and *quality* as perceived by humans, but does not explore whether these model biases correlate with other metrics like intelligibility or speaker similarity. The sample size for the subjective evaluation (15 listeners) is on the lower end for robust statistical power, though acceptable for a diagnostic study.
This work has significant implications for the TTS and speech processing communities. It challenges the uncritical use of automatic MOS predictors as proxies for human evaluation, particularly in contexts where prosody and speaker identity are crucial (e.g., expressive TTS, voice conversion). It highlights that current SSL-based models may encode spurious correlations (like mean F0) rather than perceptually relevant features. This could guide future model training, suggesting the need for explicit prosodic supervision or multi-objective loss functions that account for speaker variability. It serves as a cautionary tale for relying solely on scalar metrics in AI research. This paper provides a critical diagnostic analysis of automatic MOS prediction models, demonstrating their insensitivity to prosodic errors and misalignment with human perception regarding speaker characteristics, thereby highlighting the need for more perceptually aligned evaluation metrics in speech synthesis research.
Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.
Primary: LY Corporation
All Institutions: LY Corporation
PASQA introduces a specialized speech quality assessment model for Japanese pitch accents, leveraging synthetic data and multi-task learning to achieve superior sensitivity to prosodic errors compared to general MOS predictors. This work addresses a critical niche in TTS evaluation, providing a robust, reproducible, and human-aligned metric for accent correctness that can drive improvements in synthetic speech naturalness.
The paper proposes PASQA, a model specifically designed to assess pitch-accent correctness in Japanese speech, addressing a gap where standard MOS predictors are insensitive to localized prosodic errors. The methodology is sound and well-structured. Key innovations include the use of mora-conditioned fusion (incorporating linguistic structure into SSL features), a ranking loss (Bradley-Terry) to enforce ordinal relationships between error severities, an auxiliary frame-level error detection head for localized supervision, and a Gradient Reversal Layer (GRL) for speaker-invariant training. The use of self-supervised representations (wav2vec 2.0) as a backbone is appropriate for this task. The integration of linguistic tokens via cross-attention is a strong design choice for a language-specific prosodic task.
The experimental setup is rigorous. The authors construct a large-scale synthetic dataset with controlled accent errors, which is a significant contribution in itself, allowing for precise evaluation of sensitivity to error severity. The evaluation metrics (Order Accuracy, SRCC, KTAU) are well-chosen to test the specific hypothesis that PASQA preserves severity ordering better than conventional models. The results clearly demonstrate that PASQA outperforms both traditional acoustic feature-based models and general SSL-based MOS predictors (like UTMOS, DNSMOS, NISQA) on this specific task. The ablation study effectively isolates the contribution of each component (mora fusion, ranking loss, frame head, GRL). The inclusion of an out-of-domain (OOD) evaluation on GPT-4o-mini-TTS speech adds robustness to the claims.
The paper provides sufficient detail for reproduction. It specifies the backbone (wav2vec 2.0), the TTS system used for data generation (NANSY-TTS), the loss functions, hyperparameters (learning rate, batch size, loss weights), and the dataset construction pipeline. The code is made available on GitHub, which significantly enhances reproducibility. The synthetic nature of the training data allows other researchers to replicate the data generation process exactly.
The primary limitation is the reliance on synthetic data for training. While the controlled errors are well-defined, there may be a domain gap between synthetic accent errors and natural speech variations or errors produced by other TTS systems. The model is specifically tailored for Japanese pitch accent; while the methodology could be extended, the paper does not demonstrate cross-lingual applicability. Additionally, the "pseudo" accent-quality scores are derived from error rates, which assumes a linear or monotonic relationship between error count and perceived quality, which might not perfectly capture human perceptual nuances (e.g., context-dependent tolerance).
This work has significant implications for the development and evaluation of Text-to-Speech systems, particularly for languages with complex prosodic systems like Japanese. By providing a tool that aligns better with human judgments of accent correctness, it enables more efficient iteration cycles for TTS developers. It also contributes to the broader field of speech quality assessment by highlighting the need for task-specific metrics beyond general naturalness. The open-source release promotes further research in prosodic quality evaluation. PASQA introduces a specialized speech quality assessment model for Japanese pitch accents, leveraging synthetic data and multi-task learning to achieve superior sensitivity to prosodic errors compared to general MOS predictors. This work addresses a critical niche in TTS evaluation, providing a robust, reproducible, and human-aligned metric for accent correctness that can drive improvements in synthetic speech naturalness.
Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing models to explicitly inject features like fundamental frequency. This often requires buffering future frames, creating algorithmic lookahead latency. On the other hand, existing perturbation methods largely overlook the crucial trade-off between timbre leakage and utility preservation. Recognizing this neglected trade-off, we find that the inherent objective of Speaker Anonymization (SA) aligns well with balancing these factors. Thus, we introduce SA as a novel perturbation mechanism to explicitly mitigate timbre leakage while retaining prosodic utility. Crucially, SA's robust representations significantly alleviate the generator's reliance on future context, enabling our strictly causal, zero-lookahead network. Audio samples are available at https://amphionteam.github.io/Zero-VC-demo/.
Primary: The Chinese University of Hong Kong, Shenzhen
All Institutions: The Chinese University of Hong Kong, Shenzhen, Shenzhen Loop Area Institute, Shenzhen Transsion Holdings Co., Ltd., Amphion Technology Co., Ltd.
This paper presents a practical and effective solution for low-latency streaming voice conversion by repurposing Speaker Anonymization techniques to optimize the timbre-utility trade-off, enabling a strictly causal architecture that outperforms existing methods in both quality and latency.
The paper proposes Zero-VC, a streaming zero-shot voice conversion system that leverages Speaker Anonymization (SA) as a perturbation mechanism to disentangle timbre from linguistic content. The core methodological insight is that SA, which aims to conceal identity while preserving prosody, provides a robust feature representation that reduces the generator's reliance on future context. This allows for a strictly causal, zero-lookahead architecture (20ms latency) using causal convolutions in a HiFi-GAN-based decoder. The approach combines an off-the-shelf SA module with a streaming encoder (distilled w2v-bert-2.0) and a WavLM-based timbre extractor. While the application of SA to VC is a logical and clever adaptation of existing privacy techniques, the architectural novelty is incremental, relying heavily on standard components (HiFi-GAN, WavLM) and a specific preprocessing step. The claim of "novelty" rests more on the systematic analysis of the leakage-utility trade-off in perturbation methods than on a fundamentally new neural architecture.
The evaluation is comprehensive, comparing Zero-VC against non-streaming SOTA models (LSCodec, CosyVoice, Seed-VC) and discussing latency relative to streaming baselines. The authors use a robust set of metrics including Speaker Similarity (SS-S, SS-R), WER, F0 Pearson Coefficients, OVRL (DNSMOS), and subjective NMOS/SMOS. The results show that Zero-VC achieves superior timbre conversion (lowest SS-S, highest SS-R) and competitive quality scores while maintaining ultra-low algorithmic latency (20ms). The ablation studies effectively demonstrate that SA outperforms other perturbation methods in balancing leakage and utility, and that the SA-perturbed features saturate in performance with minimal lookahead, validating the zero-lookahead design. The use of an open-source evaluation dataset (seed-tts-eval) and clear reporting of RTF adds credibility. However, the comparison against closed-source streaming models (StreamVC, RT-VC) is limited to latency claims, which is a common but necessary limitation in this field.
The paper provides sufficient implementation details, including the dataset (LibriTTS), model architectures (w2v-bert-2.0, WavLM, HiFi-GAN), training hyperparameters (optimizer, learning rate, loss weights), and evaluation metrics. The use of off-the-shelf models for SA and timbre extraction aids reproducibility, although the specific version of the SA module is linked to a GitHub repository. The code for the Zero-VC model itself is not explicitly linked in the text provided (only the demo page), which slightly hinders immediate reproduction, but the architectural description is clear.
The primary limitation is the reliance on an external, off-the-shelf SA module for preprocessing during training. This introduces an additional dependency and potential error source, and the authors acknowledge that this pre-processing step may introduce training overhead. The paper suggests future work to integrate SA end-to-end, which implies the current system is not fully unified. Additionally, the evaluation is limited to English, and the cross-lingual capability is noted as future work. The reliance on WavLM for timbre extraction, while effective, adds computational cost compared to lighter-weight speaker embedding extractors.
Zero-VC has significant implications for real-time voice conversion applications, such as privacy-preserving communication, virtual avatars, and assistive technologies. By achieving hard-real-time latency without sacrificing conversion quality, it addresses a critical bottleneck in deploying VC systems. The use of SA also aligns with growing concerns about voice privacy and deepfake mitigation, offering a technical pathway to anonymize voices while maintaining communicative utility. The open-source demo and clear methodology contribute to the broader ML community's understanding of streaming audio generation constraints. This paper presents a practical and effective solution for low-latency streaming voice conversion by repurposing Speaker Anonymization techniques to optimize the timbre-utility trade-off, enabling a strictly causal architecture that outperforms existing methods in both quality and latency.
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.
Primary: Lightricks
All Institutions: Lightricks, Tel Aviv University
[The paper introduces ScenA, a flow-matching framework for multi-speaker audio scene generation that overcomes the "Reference Shortcut" via high-noise-biased training, enabling robust speaker binding and rich ambient audio generation from minimal natural language and reference inputs.]
The paper proposes ScenA, a novel framework for multi-speaker audio scene generation that conditions a pre-trained text-to-audio flow-matching model on multiple reference voices and free-form natural language prompts. The core methodological innovation lies in the "Reference Shortcut" diagnosis and its mitigation. The authors identify that standard flow-matching training schedules allow the model to bypass text-based speaker binding by relying on acoustic similarity between the noisy target and the clean reference latents. To counter this, they introduce a high-noise-biased timestep distribution (Beta+Uniform mixture) that forces the model to rely on the text prompt for identity assignment during the critical early denoising steps. The architecture is notably minimalist, using concatenated reference latents with lightweight identity-aware positional encodings, avoiding complex identity encoders or structured supervision tags. This approach leverages the inherent capabilities of large-scale in-the-wild audio foundation models to generate ambient textures, overlapping speech, and paralinguistic events jointly with dialogue, a significant departure from traditional speech-only TTS pipelines.
The evaluation is rigorous and well-designed, focusing on both speaker binding fidelity and audio quality. The authors utilize the CoVoMix2-Dialogue benchmark, creating specific subsets (CoVoMix2-Dialogue-20s and CoVoMix2-Dialogue-WildRef) to test performance on studio-clean versus in-the-wild references. They compare ScenA against state-of-the-art multi-speaker dialogue TTS systems (MOSS-TTS, VibeVoice, ZipVoice, Dia). Results show ScenA outperforms baselines on binding-aware metrics (cpWER, cpSIM, ACC), particularly demonstrating robustness when references are noisy or from the wild, where baselines fail significantly. The inclusion of a "Reference Shortcut Probe" provides strong empirical evidence for their hypothesis, showing that the model can identify speakers from noisy targets at low noise levels, validating the need for the high-noise training bias. Human preference tests further support the qualitative superiority of the generated scenes.
The paper provides substantial implementation details, including the backbone architecture (LTX-2 audio stream), training hyperparameters (AdamW, batch size, learning rate schedule), and the specific mathematical formulation of the timestep distribution. The dataset construction pipeline is described in detail, including the use of diarization and captioning models to create the training data. The code is not explicitly linked in the text provided (only a project page URL), but the methodological description is sufficient for reproduction by researchers familiar with flow-matching and diffusion transformers. The ablation studies on positional encodings and augmentation strategies add to the reproducibility and robustness of the claims.
The authors acknowledge several limitations. The generation duration is capped at 20 seconds due to the backbone's constraints, although they note this can be extended with modest fine-tuning. The number of supported speakers is limited to $K_{max}=3$ in the current configuration, constrained by the linear growth of the self-attention sequence with the number of references. The reliance on a pre-trained foundation model means the quality is bound by the underlying model's capabilities and potential biases in the in-the-wild training data. Additionally, the "Reference Shortcut" phenomenon, while solved for this specific setup, highlights a general fragility in reference-conditioned generation that may require similar careful schedule design in other modalities.
This work significantly advances the field of generative audio by demonstrating that complex, structured multi-speaker interactions can be generated using minimal, natural language conditioning on top of general-purpose audio models. This reduces the need for complex, brittle pipeline architectures in dialogue TTS. The ability to generate realistic, ambient-rich conversational audio has applications in virtual reality, gaming, and accessible media creation. However, the ease of cloning voices and generating realistic dialogue raises concerns about deepfakes and misinformation, necessitating responsible use guidelines and watermarking techniques, which are not discussed in the paper. [The paper introduces ScenA, a flow-matching framework for multi-speaker audio scene generation that overcomes the "Reference Shortcut" via high-noise-biased training, enabling robust speaker binding and rich ambient audio generation from minimal natural language and reference inputs.]
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.
Primary: Indian Institute of Technology Madras
All Institutions: AI4Bharat, Indian Institute of Madras, Sarvam AI
IndicContextEval introduces a rigorous, multilingual benchmark that reveals critical gaps in how AudioLLMs utilize contextual information, demonstrating that while some models effectively leverage native-script entity biasing, others suffer from blind reliance or contextual blindness, thereby establishing a new standard for evaluating contextual grounding in speech recognition systems.
The paper proposes a novel evaluation framework, IndicContextEval, designed to probe the contextual grounding capabilities of Audio Large Language Models (AudioLLMs). The core methodological contribution is the design of a 7-level prompting taxonomy (L0-L6) that systematically varies the type and quality of textual context provided to the model (from no context to adversarial incorrect entities). This controlled experimental design allows for the isolation of specific contextual signals (metadata, natural language descriptions, entity lists) and the measurement of their impact on transcription accuracy (WER) and entity recognition (NEER). The approach is rigorous in its control of variables, aiming to distinguish between parametric memorization and genuine contextual utilization.
The authors evaluate five leading AudioLLMs (GPT-4o Transcribe, Gemini 3 Flash, Sarvam Audio, Gemma-3N, and a standalone IndicConformer baseline) on a newly collected dataset of 56 hours of natural speech across 8 Indian languages and 23 professional domains. The results are insightful, revealing significant disparities in how models handle context. For instance, GPT-4o Transcribe shows robust contextual reasoning, while Gemma-3N exhibits "blind reliance" on entity prompts, even when they are adversarial. The finding that natural-language descriptions often outperform structured metadata, and that native-script entity biasing yields the largest gains, provides concrete empirical evidence for the field. The use of NEER as a primary metric for entity biasing is well-chosen and adds depth to the standard WER evaluation.
The paper provides a clear description of the dataset creation process, including speaker demographics, recording styles (read vs. extempore), and quality control measures. The prompt taxonomy is explicitly defined, allowing other researchers to replicate the evaluation protocol. The code and benchmark data are made publicly available via GitHub, which significantly enhances reproducibility. The inclusion of specific model versions and the detailed breakdown of results by language and context level further support reproducibility efforts.
The dataset, while diverse, is limited to 8 Indian languages and 23 domains, which may not capture the full spectrum of global linguistic diversity or domain-specific challenges. The reliance on commercial models (GPT-4o, Gemini) limits the ability to fully inspect internal mechanisms, although the black-box evaluation is appropriate for the benchmark's goals. The "adversarial" prompts in L6 are limited to incorrect domain entities; more sophisticated adversarial attacks (e.g., semantically similar but incorrect entities) could provide deeper insights into model robustness. Additionally, the dataset size (56 hours) is relatively small compared to large-scale ASR benchmarks, which may limit the statistical power of some analyses, particularly for lower-resource languages within the set.
This work has significant implications for the development and deployment of multilingual AudioLLMs, particularly in low-resource and high-context domains like healthcare, legal, and technical support in India. By highlighting the risks of blind reliance on context or failure to utilize it, the benchmark encourages the development of more robust and interpretable models. It also underscores the importance of native-script support and the challenges of cross-lingual entity biasing. The public release of the benchmark will facilitate fairer comparisons and drive progress in contextual ASR for Indic languages. IndicContextEval introduces a rigorous, multilingual benchmark that reveals critical gaps in how AudioLLMs utilize contextual information, demonstrating that while some models effectively leverage native-script entity biasing, others suffer from blind reliance or contextual blindness, thereby establishing a new standard for evaluating contextual grounding in speech recognition systems.
Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.
Primary: Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, Inner Mongolia University, Tencent
FineCombo-TTS introduces a novel CFM-based framework for precise, text-guided, reference-grounded speech synthesis, addressing key limitations in existing joint-control methods through a unified acoustic representation and a new paired dataset for relative attribute control.
The paper proposes FineCombo-TTS, a unified framework for controllable Text-to-Speech (TTS) that integrates reference speech and text descriptions. The core methodological contribution is the introduction of a Conditional Flow Matching (CFM)-based Speech Variance Predictor. Unlike previous methods that often treat timbre and style as separate, loosely coupled modules, this approach learns a unified acoustic attribute representation. It uses a pre-trained FACodec timbre extractor and a residual style encoder to form a source embedding, which is then transformed into a target embedding guided by text descriptions via CFM. This allows for fine-grained, relative attribute control (e.g., "make it faster" relative to the reference) rather than absolute control. The use of CFM for this transformation task is a novel application in the TTS domain, leveraging the efficiency and stability of flow matching over diffusion or autoregressive methods for this specific conditioning task. The integration of Classifier-Free Guidance (CFG) for both text and description conditions further enhances control precision.
The experimental evaluation is comprehensive, covering prosody, emotion, and timbre control. The authors construct a new dataset, FineEdit, consisting of paired source-target triplets with explicit control descriptions, addressing a gap in existing datasets that lack relative control annotations. They compare FineCombo-TTS against a re-implemented baseline, VoxInstruct-Joint, which adapts an LLM-based TTS model for joint control. The results show significant improvements in Instruction Following (MOS-I) and Controlled Accuracy, particularly in prosody and emotion tasks. The model also maintains high speaker similarity (MOS-S, SECS), indicating effective timbre preservation. The ablation studies on CFG strategies and the residual style encoder provide additional validation of the design choices. The use of both subjective (MOS) and objective (WER, SECS, FPC, Emotion-A) metrics strengthens the evaluation.
The paper provides detailed descriptions of the architecture, including the FACodec timbre extractor, T5 text encoder, and the CFM UNet backbone. The training strategy is clearly outlined in two stages. The dataset construction process for FineEdit is described in detail, including the sources (LibriTTS-R, ESD) and the methods for generating paired data. The availability of the demo and dataset at the provided URL enhances reproducibility. However, the specific hyperparameters for the CFM training (e.g., number of steps, specific loss weights beyond MSE) could be more explicitly stated, though the general framework is clear.
The paper acknowledges that existing description-based datasets are limited, which motivated the creation of FineEdit. However, the reliance on synthetic or manually annotated relative pairs may introduce biases or limitations in the diversity of control expressions compared to large-scale unsupervised data. The model's performance on out-of-distribution text descriptions or highly complex, multi-attribute simultaneous controls is not extensively explored. Additionally, the computational cost of training the CFM predictor, while potentially lower than diffusion, is still a factor compared to simpler autoregressive models. The evaluation is primarily on English data, limiting generalizability to other languages without further testing.
FineCombo-TTS contributes to the field of controllable speech synthesis by providing a more flexible and precise method for generating speech with specific attributes guided by natural language. This has implications for accessible technology, creative content generation, and personalized voice assistants. The construction of the FineEdit dataset also provides a valuable resource for future research in relative attribute control. The work aligns with the broader trend of integrating multiple conditioning modalities in generative models to achieve finer control over outputs. FineCombo-TTS introduces a novel CFM-based framework for precise, text-guided, reference-grounded speech synthesis, addressing key limitations in existing joint-control methods through a unified acoustic representation and a new paired dataset for relative attribute control.
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University
NeuralMUSIC presents a robust hybrid framework for robot sound source localization by combining neural covariance estimation with classical subspace methods, achieving superior accuracy and generalization across diverse acoustic environments. The integration of self-supervised learning for spatial correlation and adaptive frequency fusion addresses critical challenges in low-SNR and broadband scenarios, offering a significant advancement in reliable robotic audition systems.
The paper proposes NeuralMUSIC, a hybrid framework that integrates a neural network for spatial covariance matrix estimation into the classical Multiple Signal Classification (MUSIC) algorithm. The approach addresses key limitations of classical MUSIC (noise sensitivity, broadband processing) and pure deep learning methods (black-box nature, poor generalization). Key innovations include: 1) A neural encoder to predict the spatial covariance matrix, which is then used in the standard MUSIC eigen-decomposition pipeline. 2) A Frequency Attention Fusion (FAF) module to adaptively weight frequency bins for broadband DOA estimation. 3) A self-supervised Spatial Correlation Learning (SSCL) strategy using masked channel reconstruction to leverage unlabeled data. 4) An adaptive source-number prediction module. The methodology is well-grounded in signal processing theory and effectively bridges the gap between model-based and data-driven approaches. The integration of SSCL is a particularly strong design choice for robotic applications where labeled data is scarce.
The authors conduct extensive experiments on four datasets: GSC (simulated), AV16.3 (real-world speaker), SLoClas (acoustic events), and AFPILD (pedestrian footsteps). They compare against classical methods (MUSIC, NormMUSIC, Beamforming, TOPS, FRIDA) and deep learning baselines (CRNN, Transformer, DOANet, DeepDAE, DeepMusic, DA-Music). Results show consistent improvements in Mean Absolute Angular Error (MAAE) across all datasets and configurations (single-source, multi-source, unknown source number). Ablation studies validate the contributions of FAF and SSCL. Additional experiments on data efficiency, SNR robustness, and cross-domain generalization (cross-room, cross-array) further demonstrate the method's robustness. The evaluation is comprehensive and convincing.
The paper provides detailed descriptions of the network architecture, loss functions, and experimental settings (STFT parameters, optimizer, hyperparameters). The code is made available on GitHub. The use of standard datasets (GSC, AV16.3, SLoClas, AFPILD) facilitates reproduction. The description of the SSCL masking strategies and the hybrid pipeline is clear.
The method relies on a neural network to estimate the covariance matrix, which may still suffer from distribution shifts if the acoustic environment or array geometry differs significantly from the training data, although the paper shows some resilience. The performance on AFPILD, while best among baselines, is still relatively high in error (10.24 degrees), indicating challenges with footstep sounds. The cross-array generalization, while better than pure DL methods, is not perfect and degrades with large geometry mismatches. The paper acknowledges these in the limitations section.
This work contributes to the field of robot audition and spatial audio processing. By providing a robust, data-efficient, and interpretable solution for sound source localization, it enables more reliable autonomous robots in dynamic environments. The hybrid approach offers a template for integrating physical priors with deep learning in other signal processing tasks. The self-supervised learning strategy is broadly applicable to other domains with limited labeled data. NeuralMUSIC presents a robust hybrid framework for robot sound source localization by combining neural covariance estimation with classical subspace methods, achieving superior accuracy and generalization across diverse acoustic environments. The integration of self-supervised learning for spatial correlation and adaptive frequency fusion addresses critical challenges in low-SNR and broadband scenarios, offering a significant advancement in reliable robotic audition systems.
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract target-speaker representations, keeping the decoder frozen. We instantiate this as Dixtral, integrating a Diarization Conditioned Whisper (DiCoW) encoder into the Voxtral SLM. On AMI, NOTSOFAR-1, LibriSpeechMix, and Mixer6, Dixtral outperforms Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2 on speaker-attributed transcription by 29.0%, 19.8%, and 16.0% absolute cpWER respectively. On a novel long-form multi-speaker QA benchmark, zero-shot Dixtral matches Gemini on far-field content understanding, and when fine-tuned surpasses both Gemini and Voxtral operating on close-talk across all tasks.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Carnegie Mellon University
The paper presents a compelling and effective method for grounding spoken LLMs in multi-speaker audio through encoder-side diarization conditioning, achieving state-of-the-art performance on transcription and novel capabilities in multi-speaker reasoning and QA.
The paper proposes a novel architectural strategy for extending Spoken Large Language Models (SLMs) to multi-speaker scenarios by conditioning the acoustic encoder on diarization masks, rather than adapting the decoder. This approach, instantiated as Dixtral, integrates a Diarization Conditioned Whisper (DiCoW) encoder with a frozen Voxtral decoder. The core innovation lies in the "Diarization Conditioning" mechanism, which uses frame-level speaker activity probabilities (STNO masks) to modulate internal representations via learnable affine transformations (FDDT). This allows the model to extract target-speaker representations while keeping the LLM decoder frozen, thereby avoiding catastrophic forgetting of reasoning capabilities associated with Serialized Output Training (SOT) and vocabulary expansion. The methodology is theoretically sound, offering a computationally efficient alternative ($O(S N^2)$ vs $O((SN)^2)$) for multi-speaker decoding.
The evaluation is comprehensive, covering four standard multi-speaker ASR datasets (AMI, NOTSOFAR-1, LibriSpeechMix, Mixer6) and a novel long-form QA/Summarization benchmark (NSF-QA). Dixtral demonstrates significant improvements over strong baselines, including Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2, with absolute cpWER reductions of 16-29%. The inclusion of a paralinguistic QA task (emotion/gender) is particularly strong, as it tests the model's ability to utilize audio features beyond text, which cascaded systems cannot do. The results are robust, showing that zero-shot Dixtral matches Gemini on content QA and surpasses it when fine-tuned. The out-of-domain performance on Mixer6 further validates generalization.
The authors provide open-source code and a new dataset (NSF-QA). Training details are well-specified, including hardware constraints (8x A5000), optimization settings, and data chunking strategies. The use of established backbones (Whisper, Ministral, DiariZen) and clear integration points (FDDT, MLP adapter) ensures high reproducibility. The release of the benchmark dataset is a significant contribution to reproducibility in this niche.
The performance is inherently dependent on the quality of the external diarization system (DiariZen). Errors in diarization will propagate directly to the transcription and reasoning tasks. The paper acknowledges this but does not extensively analyze the sensitivity to diarization errors. Additionally, the current implementation requires separate inference passes for each target speaker, which, while more efficient than joint decoding, still scales linearly with the number of speakers. The fine-tuning for QA/Summarization slightly degrades pure ASR performance, indicating a trade-off that requires careful multi-task optimization in future work.
This work significantly advances the field of spoken language understanding by enabling end-to-end, multi-speaker reasoning in far-field audio. It bridges the gap between modular ASR pipelines and unified SLMs, offering a path towards more robust and capable voice assistants and meeting transcription tools. The ability to handle paralinguistic information (emotion, gender) in a multi-speaker context opens new avenues for affective computing and human-computer interaction. The paper presents a compelling and effective method for grounding spoken LLMs in multi-speaker audio through encoder-side diarization conditioning, achieving state-of-the-art performance on transcription and novel capabilities in multi-speaker reasoning and QA.
Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matching decoders offer superior quality but suffer from high inference latency due to iterative sampling, creating a severe quality-speed trade-off. In this paper, we propose a novel Token2Wav architecture that overcomes this limitation by applying MeanFlow in a highly compressed latent space. By modeling the average velocity rather than the instantaneous velocity field, MeanFlow enables true one-step generation. Operating in the latent domain mitigates the memory and stability issues of waveform-level flows, yielding up to a 17$\times$ improvement in Real-Time Factor (RTF) compared to multi-step baselines with negligible quality degradation. Furthermore, we introduce refinement strategies that mitigate latent mismatch, including decoder-only fine-tuning with the MeanFlow generator frozen and end-to-end joint fine-tuning, improving fidelity without increasing inference-time cost. Code and demo are publicly available.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, LIGHTSPEED, Tencent, The Hong Kong University of Science and Technology
This paper presents a significant advancement in low-latency neural audio synthesis by successfully adapting MeanFlow to a latent space for one-step Token-to-Waveform generation, achieving a substantial reduction in inference latency while maintaining high perceptual quality through innovative refinement strategies.
The paper proposes a novel architecture for Token-to-Waveform (Token2Wav) generation by combining MeanFlow (a one-step flow matching variant) with a latent space representation. The core innovation lies in applying MeanFlow in a compressed latent space rather than waveform space, which addresses the stability and memory issues typically associated with one-step generation at high resolutions. The methodology includes a lightweight VAE for latent encoding/decoding and a DiT-1D model for the one-step latent generation. Crucially, the authors address the "latent mismatch" problem—where generated latents deviate from the VAE's training distribution—through two refinement strategies: decoder-only fine-tuning and end-to-end joint fine-tuning. This is a technically sound approach that effectively bridges the gap between the efficiency of autoregressive/LLM-based token generation and the quality of continuous flow-based vocoders. The use of MeanFlow to eliminate iterative sampling is a significant methodological contribution to low-latency audio synthesis.
The experimental setup is rigorous, utilizing LibriTTS for training and LibriSpeech test-clean for evaluation. The authors compare their method against CosyVoice2's Token2Wav module, a strong multi-step flow baseline. They report Real-Time Factor (RTF), Word Error Rate (WER), Speaker Similarity (SpkSim), and perceptual metrics (UTMOS, MOS). The results demonstrate a substantial speedup (17x RTF improvement) with negligible degradation in quality metrics. The ablation studies are particularly valuable, analyzing the impact of latent dimensionality, model capacity, and the specific refinement strategies. The finding that larger models do not necessarily perform better in one-step generation is an insightful empirical observation. The inclusion of both objective and subjective metrics strengthens the evaluation.
The paper provides clear descriptions of the model architectures, loss functions, and training procedures. The code and demo are publicly available on GitHub, which significantly enhances reproducibility. The authors specify the tokenization scheme (CosyVoice2 tokenizer), speaker embedding extractor (CAM++), and evaluation protocols. The detailed breakdown of RTF and the specific hyperparameters for the VAE and DiT models allow for faithful reproduction of the results.
The primary limitation is the reliance on a pre-trained VAE, which introduces a fixed bottleneck in terms of information loss during encoding. While the refinement strategies mitigate this, they do not eliminate the fundamental trade-off between compression ratio and fidelity. Additionally, the performance is evaluated primarily on LibriSpeech, which consists of read speech; performance on conversational or noisy speech is not reported. The "one-step" nature also implies that the model cannot easily refine outputs iteratively, which might be a limitation for applications requiring high-fidelity correction of errors. The paper notes that larger models did not improve performance, suggesting that the one-step constraint is difficult to scale without further architectural or training innovations.
This work has significant implications for the deployment of LLM-based TTS systems, particularly in latency-sensitive applications like real-time voice assistants, gaming, and on-device AI. By enabling high-quality, one-step audio generation, it reduces computational costs and energy consumption, making advanced speech synthesis more accessible. The approach could also be extended to other modalities where low-latency generation is critical. However, as with all audio generation technologies, there are potential misuse cases regarding voice cloning and deepfakes, necessitating responsible deployment practices. This paper presents a significant advancement in low-latency neural audio synthesis by successfully adapting MeanFlow to a latent space for one-step Token-to-Waveform generation, achieving a substantial reduction in inference latency while maintaining high perceptual quality through innovative refinement strategies.
As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.
Primary: Catnip AI Team
All Institutions: Catnip AI Team
MaineCoon presents a comprehensive engineering solution for real-time audio-visual autoregressive generation, effectively combining self-resampling, representation alignment, and reinforced policy distillation to achieve stable, long-horizon social video synthesis at high frame rates, marking a significant step toward practical AI-native social platforms.
The paper proposes MaineCoon, a 22B-parameter autoregressive audio-visual model designed for real-time social interaction. The core methodological contributions lie in the training pipeline and inference framework rather than a novel architectural backbone. The authors introduce "self-resampling" to mitigate the train-test discrepancy in autoregressive generation by exposing the model to its own degraded histories during training. They also employ representation alignment (REPA) using a frozen V-JEPA teacher to accelerate convergence and improve semantic coherence. A significant portion of the technical depth is dedicated to post-training strategies: Domain-Aware Preference Optimization (DPO) to specialize the model in difficult social domains (e.g., wide shots, multi-person dialogue) and Reinforced Online-Policy Distillation (ROPD) to consolidate these specialized experts into a single streaming policy. The inference framework is agentic, utilizing a planner/observer LLM to manage prompt continuity and a cache manager to mitigate long-horizon drift. While the integration of these components is sophisticated, the individual techniques (self-resampling, REPA, DPO, ROPD) are adaptations of existing methods to the specific constraints of real-time audio-visual autoregressive generation. The novelty is moderate, lying primarily in the specific application and engineering orchestration rather than fundamental algorithmic breakthroughs.
The experimental evaluation focuses on demonstrating the feasibility of real-time generation (47.5 FPS on a single H100) and long-horizon stability. The authors introduce "SocialVideo Bench," a new benchmark tailored for social video content, evaluating visual quality, motion, audio quality, and audio-visual alignment. They compare MaineCoon against 7 representative open audio-visual generation models, claiming SOTA performance in terms of speed and quality metrics. The evaluation includes ablation studies on the training components (self-resampling, alignment, ROPD). However, the paper lacks extensive human preference studies (e.g., MOS comparisons) which are critical for social video quality assessment. The reliance on automated metrics (CLIP, SyncNet, etc.) is standard but less convincing for subjective qualities like "social harmony" or "emotional resonance." The claim of "record-breaking" performance is supported by the FPS metric, but the quality comparisons are somewhat qualitative or based on limited automated scores. The long-horizon evaluation (thousand-second scale) is a strong point, demonstrating the efficacy of the agentic cache management.
The paper provides detailed descriptions of the data pipeline, including filtering criteria, domain balancing, and the synthetic data generation process. The training infrastructure details (FSDP2, sequence parallelism, mixed precision) are well-specified. The mathematical formulations for self-resampling, REPA, and ROPD are clear. However, the code and model weights are not mentioned as being publicly available (URLs are "none"), which hinders immediate reproducibility. The reliance on proprietary teacher models (LTX-2.3, Gemini-3.1-flash) for data generation and evaluation adds a layer of dependency that might make exact replication difficult for independent researchers.
The paper acknowledges that the model is a "first step" and focuses on the generative core. Limitations include the potential for residual drift in very long generations despite the agentic repair mechanisms, the high computational cost of the pre-training and post-training phases (though claimed to be efficient relative to scale), and the dependency on high-quality social video data which may contain biases. The model is specialized for "social" videos (talking heads, single speakers), limiting its generalizability to cinematic or complex physical interaction scenarios. The "agentic" nature of the inference introduces latency variability depending on the LLM planner's speed, although the paper claims sub-second interaction.
This work has significant implications for the development of AI-native social platforms, virtual influencers, and interactive entertainment. By enabling real-time, long-horizon audio-visual generation, it lowers the barrier for creating interactive AI agents. However, it also raises concerns about deepfakes, misinformation, and the ethical use of AI-generated social interactions. The ability to generate realistic human speech and video in real-time could be misused for impersonation or harassment. The authors' focus on "social world models" points to a future where AI is an active participant in social dynamics, which has profound philosophical and societal implications. MaineCoon presents a comprehensive engineering solution for real-time audio-visual autoregressive generation, effectively combining self-resampling, representation alignment, and reinforced policy distillation to achieve stable, long-horizon social video synthesis at high frame rates, marking a significant step toward practical AI-native social platforms.
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level BERT into a phoneme-level BERT. This design provides prior word- and sentence-level information, reducing the amount of pre-training required by the phoneme encoder. Subjective listening evaluations show that CraBERT achieves MOS values comparable to existing PPEncs after approximately one epoch of pre-training, whereas the baselines in our comparison are pre-trained for approximately ten epochs. These results demonstrate that CraBERT can efficiently learn representations suitable for improving the perceived naturalness and prosody of synthesized speech.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
This paper introduces CraBERT, an efficient phoneme encoder that significantly reduces pre-training time while maintaining high-quality speech synthesis. The innovative integration of subword representations and the development of a new alignment algorithm mark a notable advancement in the field of text-to-speech technologies.
The methodology presented in this paper is innovative in its use of a cascade-fusion architecture that integrates subword representations from a pre-trained BERT model into a phoneme-level BERT. This approach addresses the inefficiencies of traditional phoneme encoders by leveraging existing word- and sentence-level information, significantly reducing the pre-training time required for effective phoneme representation. The introduction of a data-driven subword-phoneme alignment algorithm based on dynamic time warping (DTW) further enhances the methodology, providing a systematic way to fuse these representations.
The experimental evaluation is robust, employing subjective listening tests to assess the quality of synthesized speech using mean opinion scores (MOS). The results indicate that CraBERT achieves comparable performance to existing phoneme encoders after a fraction of the pre-training time, demonstrating its efficiency. The use of a multi-speaker dataset from the LibriTTS corpus adds credibility to the findings, although more diverse datasets could strengthen the generalizability of the results.
The paper provides detailed descriptions of the architecture, pre-training processes, and experimental setups, which are essential for reproducibility. However, the lack of publicly available code or a project repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their implementation to enhance reproducibility.
One limitation is the reliance on a single pre-trained model (DistilBERT) for subword representations, which may not generalize across all languages or phonetic systems. Additionally, while the subjective evaluations show promising results, they are limited to a specific dataset and may not reflect performance across different languages or dialects. The paper also does not explore the potential for further optimization of the alignment algorithm.
The implications of this research are significant for the field of text-to-speech synthesis, particularly in improving the efficiency of phoneme encoders. The advancements in pre-training methodologies could lead to more accessible and faster TTS systems, which can be beneficial in various applications, including virtual assistants, audiobooks, and language learning tools. The approach could also inspire further research into efficient representation learning in other domains. This paper introduces CraBERT, an efficient phoneme encoder that significantly reduces pre-training time while maintaining high-quality speech synthesis. The innovative integration of subword representations and the development of a new alignment algorithm mark a notable advancement in the field of text-to-speech technologies.
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology
The main contribution of this paper is the introduction of MuVAP, a causal multimodal framework that effectively predicts turn-taking in multiparty conversations using a single audio stream and a single camera view. This innovative approach, combined with the creation of the AVCC dataset, addresses critical limitations in existing methods and has the potential to advance the field of conversational AI significantly.
The paper presents MuVAP, a novel multimodal framework that integrates audio and visual data to predict turn-taking in multiparty conversations. The methodology is well-structured, introducing Role-Relative Projection to simplify the complexity of multiparty interactions by focusing on the current and next speaker. The use of a single audio channel and a single camera view is a significant departure from traditional methods, which often require complex setups. The introduction of the Audio-Visual Conversation Corpus (AVCC) is a crucial contribution, as it provides a dataset specifically designed for this type of analysis, addressing the limitations of existing datasets.
The experiments are comprehensive, comparing MuVAP against strong baselines across various tasks, including Shift-Hold and Next Speaker Prediction. The results demonstrate that MuVAP outperforms these baselines, showcasing its effectiveness in real-world scenarios. The evaluation metrics used, such as Macro-F1 for Shift-Hold Prediction and accuracy for Next Speaker Prediction, are appropriate for the tasks at hand. The paper provides detailed results and analysis, indicating a thorough evaluation process.
The paper includes sufficient implementation details, including the architecture of the model, training procedures, and the datasets used. However, the lack of a public demo or clear access to the trained models may hinder full reproducibility. The GitHub repository provides some resources, but additional documentation would enhance reproducibility.
The paper acknowledges limitations, such as the class imbalance introduced by the Role-Relative Projection and the reliance on visual tracking that may miss subtle facial cues. Additionally, the model's performance is evaluated primarily on two- and three-speaker settings, which may not fully represent larger group dynamics. The authors also note the potential for improved performance with more advanced visual encoders.
The implications of this research are significant for human-robot interaction and conversational AI, as it enables more natural and responsive interactions in multiparty settings. The ability to predict turn-taking dynamics using standard hardware makes this approach accessible for various applications, including social robotics, virtual assistants, and interactive media. The main contribution of this paper is the introduction of MuVAP, a causal multimodal framework that effectively predicts turn-taking in multiparty conversations using a single audio stream and a single camera view. This innovative approach, combined with the creation of the AVCC dataset, addresses critical limitations in existing methods and has the potential to advance the field of conversational AI significantly.
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.
Primary: Carnegie Mellon University Africa
All Institutions: Carnegie Mellon University Africa
The main contribution of this paper is the identification of training configuration as the primary cause of quality degradation in neural audio codecs at low frame rates, challenging previous assumptions about the inherent limitations of such systems. This work offers a novel perspective on codec design, emphasizing the importance of training methodologies in achieving efficient and intelligible audio synthesis.
The authors employ a controlled ablation study to investigate the effects of low frame rates on neural audio codecs, specifically focusing on the training configurations that lead to performance degradation. They systematically analyze potential causes for the observed quality cliff, such as phonemic collisions and codebook saturation, and identify a training misconfiguration as the primary issue. This approach is methodologically sound, as it combines theoretical analysis with empirical validation, allowing for clear conclusions about the limitations of current training practices.
The experiments are well-structured, utilizing a range of frame rates and comparing the performance of various codecs on established benchmarks such as WER, STOI, and SPK-SIM. The use of a comprehensive dataset (LibriSpeech) and the evaluation of multiple metrics provide a robust assessment of codec performance across different configurations. The results clearly illustrate the impact of training configuration on codec intelligibility at low frame rates, thus contributing valuable insights into the design of future codecs.
The paper provides sufficient details regarding the training process, model architectures, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the lack of publicly available code or models limits the ease of reproducibility. Including a project URL with the code would enhance this aspect significantly.
One limitation is the focus on a specific dataset (LibriSpeech), which may not generalize to all audio synthesis tasks or languages. Additionally, while the authors identify training configuration as a key factor, they do not explore other potential architectural modifications that could further improve performance at low frame rates. The paper also lacks a discussion on the computational costs associated with training and inference at these low frame rates.
The findings have significant implications for the design of neural audio codecs, particularly in applications where inference efficiency is critical, such as real-time speech synthesis and low-latency communication systems. By demonstrating that low frame rates can be utilized effectively with appropriate training strategies, this work paves the way for more efficient audio processing technologies in various domains. The main contribution of this paper is the identification of training configuration as the primary cause of quality degradation in neural audio codecs at low frame rates, challenging previous assumptions about the inherent limitations of such systems. This work offers a novel perspective on codec design, emphasizing the importance of training methodologies in achieving efficient and intelligible audio synthesis.
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.
Primary: IIIT-Delhi
All Institutions: IIIT-Delhi, UPES, VBSPU
This paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight Small-ALM that outperforms existing models while addressing practical deployment challenges. The technical contributions are significant, with a strong focus on methodology and experimental validation, positioning this work as a valuable asset in the field of audio deepfake detection.
The methodology presented in the paper is robust, introducing the SEA-CF benchmark for CF detection in SEA languages, which is a significant advancement given the lack of resources in this area. The authors propose GARUDA, a lightweight Small-ALM that effectively combines dual-encoder architectures to capture semantic and prosodic features, which is innovative. The use of JS divergence as a loss function for aligning representations is a novel approach that enhances the model's performance. Overall, the methodology is well-structured and addresses practical deployment challenges.
The experimental evaluation is comprehensive, utilizing both zero-shot and fine-tuned settings to assess the performance of GARUDA against existing SOTA models. The results demonstrate significant improvements over baselines, with rigorous statistical testing (McNemar’s test) validating the findings. The paper effectively highlights the necessity of in-domain training and the limitations of existing models when applied to SEA languages, underscoring the importance of the proposed SEA-CF benchmark.
The paper provides sufficient details regarding the dataset construction, model architecture, and training procedures, which enhances reproducibility. The authors mention the use of publicly available datasets and provide a project URL for accessing the SEA-CF benchmark, which is crucial for other researchers looking to replicate or build upon this work.
While the paper makes significant contributions, it acknowledges limitations such as the incomplete coverage of all SEA languages and the current restriction of evaluations to available benchmarks. Future work is needed to expand the dataset and improve generalization across diverse generators.
The work has substantial implications for enhancing security against audio deepfakes in low-resource language contexts, addressing a critical gap in the current landscape of speech technology. By focusing on SEA languages, the research promotes inclusivity and provides tools that can be vital for protecting vulnerable communities against audio fraud. This paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight Small-ALM that outperforms existing models while addressing practical deployment challenges. The technical contributions are significant, with a strong focus on methodology and experimental validation, positioning this work as a valuable asset in the field of audio deepfake detection.
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Institute of Forensic Science, Ministry of Public Security, The Hong Kong Polytechnic University
The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
The proposed DuraMark framework introduces a novel approach to watermarking in LLM-based TTS systems by embedding watermarks at the information level through syllable duration editing. This method is innovative as it leverages a duration-controllable TTS model and a duration extractor, which allows for precise control over the watermarking process while maintaining the naturalness of the synthesized speech. The integration of these components is well-structured, and the methodology is clearly articulated, allowing for a thorough understanding of the process.
The experiments conducted are robust, utilizing a substantial dataset and comparing DuraMark against established signal-level watermarking methods. The evaluation metrics include True Positive Rate (TPR) under various attack scenarios, which is a relevant measure of robustness. The results demonstrate DuraMark's superior performance, particularly against generative attacks, which is a critical aspect of the paper's claims. The use of both objective and subjective metrics to assess speech naturalness further strengthens the experimental evaluation.
The paper provides sufficient detail regarding the experimental setup, including the datasets used and the training parameters. However, the absence of a public code repository limits reproducibility. While the methodology is clearly described, access to the code would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a specific language (Mandarin Chinese) for the experiments, which may affect the generalizability of the findings to other languages or dialects. Additionally, while the paper demonstrates robustness against various attacks, it does not explore the performance of DuraMark under more extreme or novel attack scenarios that may arise in real-world applications.
The implications of this research are significant, particularly in the context of combating deepfake technologies and ensuring the integrity of synthesized speech. The DuraMark framework could be applied in various fields, including media, security, and digital forensics, where the authenticity of audio content is crucial. The potential for this technology to enhance trust in AI-generated content is noteworthy. The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient style learning and thus limiting speaker similarity in synthesized speech. To this end, we investigate the prosody learning conditioned on the synthesized speech, and propose to predict the prosody of the current syllable based on previously predicted speech. Experimental results obtained on three datasets demonstrated the efficacy of the proposed dynamic prosody prediction method in enhancing the prosody learning capability, thereby improving the speaker similarity of the generated speech. Audio samples are available at https://muzw.github.io/dynapros/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFLYTEK
The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
The proposed dynamic prosody prediction method represents a significant advancement in TTS technology by allowing for syllable-level prosody prediction based on previously generated speech. This approach addresses the limitations of existing methods that typically rely on static prosody modeling. The integration of prosody prediction into the speech generation process is well-justified and demonstrates a clear understanding of the challenges in personalized TTS systems. The methodology is sound, with a clear architecture and loss function defined, although the paper could benefit from more detailed explanations of the equations presented.
The experiments are comprehensive, utilizing three diverse datasets that cover a range of emotional and stylistic variations. The results are presented clearly, showing improvements in speaker similarity and prosody modeling capabilities. The use of both objective metrics (e.g., CER, emotion similarity) and subjective evaluations (e.g., MOS, preference tests) adds robustness to the findings. However, the paper could enhance its credibility by providing more detailed statistical analyses of the results, such as confidence intervals or significance testing.
The paper provides sufficient details regarding the experimental setup, including the datasets used, model architectures, and training procedures. The availability of the CosyVoice implementation and audio samples supports reproducibility. However, the lack of specific hyperparameter settings and training configurations for the proposed model could hinder complete reproducibility.
One limitation of the study is its focus on Mandarin Chinese, which may restrict the applicability of the findings to other languages or dialects. Additionally, while the proposed method shows promise in improving speaker similarity, the paper does not address potential challenges in real-world applications, such as the computational efficiency of the model during inference.
The proposed method has significant implications for the development of personalized TTS systems, particularly in applications such as virtual assistants, audiobooks, and entertainment. By improving speaker similarity, the approach could enhance user experience and engagement in various audio-related applications. Furthermore, the findings may inspire further research into dynamic prosody modeling in other languages and contexts. The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland
This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
The proposed methodology introduces a phoneme-guided cross-attention framework that significantly enhances the interpretability of speech deepfake detection systems. By leveraging phonetic posteriorgrams (PPGs) as a structural interface, the framework allows for a detailed analysis of the contribution of each phonetic class to the detection decision. This contrasts with traditional models that produce a single score without insight into the phonetic structure. The probabilistic factorization of the spoofing posterior into per-phone contributions is a novel approach that provides a clear, interpretable mechanism for understanding model behavior, which is a significant advancement in the field of explainable AI in speech processing.
The experimental evaluation is robust, utilizing three datasets of varying complexity, including a controlled corpus and standard benchmarks like ASVspoof 2019 LA. The results demonstrate competitive performance while also providing insights into the discriminative power of different phonetic categories. The targeted phoneme-group ablation study further validates the importance of articulatory categories, confirming the model's ability to isolate and rank the contributions of different phonetic classes effectively.
The paper lacks explicit details regarding the implementation and availability of the code or models, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a publicly accessible repository or demo limits the ability for other researchers to validate and build upon the findings.
One limitation is the reliance on the quality of the phonetic posteriorgrams, which may introduce noise or inaccuracies if the phoneme extraction process is not robust. Additionally, while the model shows promise in structured interpretability, it may still struggle with complex, real-world scenarios where the phonetic structure is less clear. The paper does not address potential biases in the datasets used for training and evaluation.
The implications of this work are significant, particularly in the context of forensic voice analysis and anti-spoofing measures in security systems. By enhancing the interpretability of deepfake detection, the framework could facilitate more reliable applications in legal and security settings, where understanding the basis of decisions is crucial. Furthermore, the integration of phonetic structures into detection systems may inspire new research avenues in both speech synthesis and recognition. This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.