Large reasoning models typically follow a read-then-think paradigm: they observe the complete input, reason over a static context, and then produce the answer. Yet many real-world scenarios are inherently dynamic, such as audio and video stream, where information arrives as a continuous stream and models must reason, update, and respond under partial observations. Recent streaming reasoning methods allow models to think while reading, but they largely rely on supervised imitation of pre-constructed trajectories, which limits their flexibility. In this paper, we propose AdaSR, an adaptive streaming reasoning framework that enables models to reason during input streaming and perform final deliberation once the stream is complete, learning when to think, and how much computation to allocate across different stages. To optimize this hierarchical reasoning process, we introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, providing more fine-grained advantage assignment instead of uniformly distributing a single sequence-level advantage over all tokens. HRPO integrates format, accuracy, and adaptive thinking rewards to enforce valid reasoning protocols, preserve final task performance, and encourage latency-aware computation allocation. Experiments show that AdaSR achieves a better balance among reasoning accuracy, computational efficiency, and streaming latency compared with supervised fine-tuning baseline. We release our code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.
Primary: Eastern Institute of Technology
All Institutions: Eastern Institute of Technology, Shanghai Jiao Tong University, The Hong Kong Polytechnic University, Southeast University, Xi'an Jiaotong-Liverpool University
The main contribution of this paper is the introduction of AdaSR, an innovative framework for adaptive streaming reasoning that optimizes the reasoning process in dynamic environments through hierarchical policy optimization and adaptive rewards. This work represents a significant advancement in the field of machine learning, particularly in the context of real-time reasoning and decision-making under uncertainty.
The paper introduces AdaSR, a novel adaptive streaming reasoning framework that leverages reinforcement learning to optimize reasoning during dynamic input streams. The methodology is robust, incorporating Hierarchical Relative Policy Optimization (HRPO) to address the temporal credit assignment problem inherent in streaming reasoning. By decomposing the policy optimization into distinct phases, AdaSR allows for more nuanced advantage assignment, which is a significant improvement over traditional methods that apply uniform advantages across all tokens. The integration of adaptive rewards further enhances the model's ability to balance reasoning accuracy and computational efficiency.
The experiments are comprehensive, evaluating AdaSR against multiple benchmarks in reasoning tasks, including mathematical reasoning and context-based question answering. The results demonstrate significant improvements in accuracy and efficiency compared to baseline models, indicating the effectiveness of the proposed approach. The paper provides detailed metrics on accuracy, token lengths, and latency, which are critical for assessing the performance of streaming reasoning models.
The authors have released their code, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics that would facilitate easier replication of the experiments, such as hyperparameter settings and training configurations.
The paper acknowledges that AdaSR is primarily focused on text streams with verifiable answers, which may limit its applicability to more complex scenarios involving continuous audio or video streams. Additionally, the reliance on reinforcement learning may introduce challenges in training stability and convergence, which are not thoroughly addressed.
The proposed framework has the potential to significantly enhance real-time reasoning capabilities in various applications, including interactive AI systems, real-time translation, and autonomous agents. By enabling models to adaptively allocate computation based on input dynamics, AdaSR could lead to more responsive and efficient AI systems in real-world scenarios. The main contribution of this paper is the introduction of AdaSR, an innovative framework for adaptive streaming reasoning that optimizes the reasoning process in dynamic environments through hierarchical policy optimization and adaptive rewards. This work represents a significant advancement in the field of machine learning, particularly in the context of real-time reasoning and decision-making under uncertainty.
We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at https://github.com/yonghyunk1m/TuneJury.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Sony AI, Georgia Tech, KAIST, Peking University, QMUL
TuneJury presents a novel approach to music generation preference alignment through a pairwise reward model, demonstrating competitive performance with a lean architecture and practical applications in real-world scenarios. The comprehensive evaluation and innovative calibration method position this work as a meaningful contribution to the field of machine learning in audio.
The methodology introduces TuneJury as a pairwise reward model for text-to-music generation, leveraging a small MLP head over frozen audio and text encoders. The choice of a pairwise approach is well-justified, addressing the limitations of absolute scoring systems in subjective domains like music. The model is trained on a diverse set of human-rated pairs, which enhances its robustness and generalizability. The introduction of anchor calibration as a post-hoc adjustment method is a notable innovation that allows for adaptation to new systems without the need for retraining, showcasing a practical approach to real-world application.
The experimental evaluation is comprehensive, utilizing multiple datasets and benchmarks to assess the performance of TuneJury. The authors provide detailed comparisons against existing models, demonstrating that TuneJury achieves competitive accuracy with fewer parameters and without relying on pseudo-label augmentation. The results are statistically significant, with clear metrics reported for pairwise accuracy and calibration, as well as downstream applications. The experiments effectively illustrate the model's capabilities across different scenarios, including inference-time selection and latent optimization.
The paper includes sufficient details on the training procedure, architecture, and datasets used, which enhances reproducibility. The authors have made the code, checkpoints, and demo available, which is a strong point for enabling other researchers to replicate their findings. However, some hyperparameter settings and specific implementation details could be more explicitly stated to further aid reproducibility.
The paper acknowledges several limitations, including potential biases in the training data, particularly the lack of representation for vocal music and the calibration signal's dependence on the specific datasets used. The performance drop on post-cutoff splits indicates that the model may not generalize well to newer music generation systems, which could limit its applicability in rapidly evolving contexts.
TuneJury has the potential to significantly impact the field of music generation by providing a more aligned and efficient method for evaluating generated music against human preferences. Its open-source nature encourages community engagement and further research, potentially leading to advancements in multimodal systems that combine text and audio understanding. The implications for music generation tools and applications in creative industries are substantial, as this model could enhance user experience and satisfaction in automated music creation. TuneJury presents a novel approach to music generation preference alignment through a pairwise reward model, demonstrating competitive performance with a lean architecture and practical applications in real-world scenarios. The comprehensive evaluation and innovative calibration method position this work as a meaningful contribution to the field of machine learning in audio.
Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Meanwhile, Mamba, a recent state-space model, has emerged as a promising alternative to Transformers for sequence modeling. In this work, we present MambAdapter, a parameter-efficient transfer learning approach that integrates Mamba into low-rank bottleneck adapters. Our design combines parameter sharing across adapters with the injection of a lightweight Mamba module, enabling more effective modeling of audio features. We demonstrate that MambAdapter matches or outperforms strong PETL baselines on four audio classification tasks and five speech recognition languages, even when operating under reduced parameter budgets.
Primary: Université de Montréal
All Institutions: Université de Montréal, Imperial College London, Concordia University, Mila -- Quebec AI Institute
The main contribution of this work is the introduction of MambAdapter, a novel parameter-efficient transfer learning method that combines Mamba's state-space modeling with low-rank bottleneck adapters, achieving competitive performance on audio and speech tasks while significantly reducing the number of trainable parameters. This paper represents a meaningful advancement in the quest for efficient model adaptation in the rapidly evolving field of audio processing.
The paper introduces MambAdapter, which innovatively integrates Mamba, a state-space model, into low-rank bottleneck adapters for parameter-efficient transfer learning in speech and audio tasks. The methodology is well-grounded in existing literature, leveraging the strengths of Mamba's linear-time modeling capabilities while addressing the inefficiencies of traditional Transformer fine-tuning. The use of shared projections and the lightweight Mamba module is a thoughtful design choice that enhances the model's ability to capture long-range dependencies in audio data.
The experimental setup is robust, with comprehensive evaluations across multiple audio classification tasks and multilingual speech recognition. The authors provide a clear comparison against established PETL baselines, demonstrating that MambAdapter achieves competitive or superior performance while maintaining a lower parameter budget. The results are statistically validated through averaging over multiple random seeds, which adds credibility to their findings.
The paper includes a link to the code repository, which is essential for reproducibility. However, the paper could benefit from more detailed hyperparameter settings and training configurations to facilitate easier replication of results by other researchers.
While the paper presents promising results, it does not extensively explore the limitations of MambAdapter, such as potential performance degradation in extremely low-resource settings or the impact of varying audio characteristics on model performance. Additionally, the focus on a limited number of datasets may restrict the generalizability of the findings.
The integration of Mamba into PETL frameworks has significant implications for the field of audio and speech processing, particularly in resource-constrained environments. The findings could influence future research directions in efficient model adaptation, potentially leading to advancements in real-time speech recognition and audio classification applications. The main contribution of this work is the introduction of MambAdapter, a novel parameter-efficient transfer learning method that combines Mamba's state-space modeling with low-rank bottleneck adapters, achieving competitive performance on audio and speech tasks while significantly reducing the number of trainable parameters. This paper represents a meaningful advancement in the quest for efficient model adaptation in the rapidly evolving field of audio processing.
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.
Primary: Indian Institute of Technology Madras
All Institutions: AI4Bharat, Indian Institute of Madras, Sarvam AI
IndicContextEval introduces a rigorous, multilingual benchmark that reveals critical gaps in how AudioLLMs utilize contextual information, demonstrating that while some models effectively leverage native-script entity biasing, others suffer from blind reliance or contextual blindness, thereby establishing a new standard for evaluating contextual grounding in speech recognition systems.
The paper proposes a novel evaluation framework, IndicContextEval, designed to probe the contextual grounding capabilities of Audio Large Language Models (AudioLLMs). The core methodological contribution is the design of a 7-level prompting taxonomy (L0-L6) that systematically varies the type and quality of textual context provided to the model (from no context to adversarial incorrect entities). This controlled experimental design allows for the isolation of specific contextual signals (metadata, natural language descriptions, entity lists) and the measurement of their impact on transcription accuracy (WER) and entity recognition (NEER). The approach is rigorous in its control of variables, aiming to distinguish between parametric memorization and genuine contextual utilization.
The authors evaluate five leading AudioLLMs (GPT-4o Transcribe, Gemini 3 Flash, Sarvam Audio, Gemma-3N, and a standalone IndicConformer baseline) on a newly collected dataset of 56 hours of natural speech across 8 Indian languages and 23 professional domains. The results are insightful, revealing significant disparities in how models handle context. For instance, GPT-4o Transcribe shows robust contextual reasoning, while Gemma-3N exhibits "blind reliance" on entity prompts, even when they are adversarial. The finding that natural-language descriptions often outperform structured metadata, and that native-script entity biasing yields the largest gains, provides concrete empirical evidence for the field. The use of NEER as a primary metric for entity biasing is well-chosen and adds depth to the standard WER evaluation.
The paper provides a clear description of the dataset creation process, including speaker demographics, recording styles (read vs. extempore), and quality control measures. The prompt taxonomy is explicitly defined, allowing other researchers to replicate the evaluation protocol. The code and benchmark data are made publicly available via GitHub, which significantly enhances reproducibility. The inclusion of specific model versions and the detailed breakdown of results by language and context level further support reproducibility efforts.
The dataset, while diverse, is limited to 8 Indian languages and 23 domains, which may not capture the full spectrum of global linguistic diversity or domain-specific challenges. The reliance on commercial models (GPT-4o, Gemini) limits the ability to fully inspect internal mechanisms, although the black-box evaluation is appropriate for the benchmark's goals. The "adversarial" prompts in L6 are limited to incorrect domain entities; more sophisticated adversarial attacks (e.g., semantically similar but incorrect entities) could provide deeper insights into model robustness. Additionally, the dataset size (56 hours) is relatively small compared to large-scale ASR benchmarks, which may limit the statistical power of some analyses, particularly for lower-resource languages within the set.
This work has significant implications for the development and deployment of multilingual AudioLLMs, particularly in low-resource and high-context domains like healthcare, legal, and technical support in India. By highlighting the risks of blind reliance on context or failure to utilize it, the benchmark encourages the development of more robust and interpretable models. It also underscores the importance of native-script support and the challenges of cross-lingual entity biasing. The public release of the benchmark will facilitate fairer comparisons and drive progress in contextual ASR for Indic languages. IndicContextEval introduces a rigorous, multilingual benchmark that reveals critical gaps in how AudioLLMs utilize contextual information, demonstrating that while some models effectively leverage native-script entity biasing, others suffer from blind reliance or contextual blindness, thereby establishing a new standard for evaluating contextual grounding in speech recognition systems.
Few-shot adaptation of pretrained Audio--Language Models (ALMs) often improves seen-class performance at the cost of unseen-class generalization, leading to the base-to-new trade-off. We attribute this failure to zero-shot drift in the text embedding space: few-shot tuning can distort inter-class structure and move adapted embeddings far from their pretrained anchors. We therefore propose Subspace Tuning (SubT), a geometry-constrained adaptation framework with two complementary controls on drift. Structured Subspace Parameterization limits structural deformation, and Residual Anchoring stabilizes adaptation around the zero-shot prior. At inference time, Subspace-aware Gating further suppresses negative transfer for weakly aligned unseen classes. Across 11 audio benchmarks, SubT delivers strong few-shot generalization while remaining efficient, operating directly on precomputed text embeddings without text-encoder backpropagation.
Primary: KAIST
All Institutions: KAIST
This paper presents a novel geometry-constrained adaptation framework, Subspace Tuning, which effectively mitigates the base-to-new generalization trade-off in Audio-Language Models by preserving the structural integrity of the text embedding space, leading to significant improvements in few-shot generalization across diverse audio benchmarks.
The paper proposes Subspace Tuning (SubT), a parameter-efficient adaptation method for Audio-Language Models (ALMs) that addresses the base-to-new generalization trade-off. The core innovation lies in constraining the geometry of the text embedding space during few-shot adaptation. Specifically, it employs Structured Subspace Parameterization (using SVD to fix class-dependent coordinates and learn only a shared basis) and Residual Anchoring (to limit magnitude drift from zero-shot prototypes). At inference, a Subspace-aware Gating mechanism modulates the transfer of this basis shift to unseen classes based on their alignment with the base subspace. The approach is theoretically grounded in the observation that unconstrained adaptation distorts inter-class relational structures (Gram matrix drift) and displaces prototypes from their semantic anchors. The method is mathematically sound, offering a clear geometric interpretation of adaptation dynamics.
The authors evaluate SubT on 11 diverse audio benchmarks, including sound event classification, emotion recognition, and acoustic scene classification. The results demonstrate that SubT significantly outperforms strong baselines like CoOp, CoCoOp, SEPT, and CLIP-Adapter in terms of the harmonic mean of base and new class accuracy. The paper provides extensive ablation studies confirming the contribution of each component (SSP, RA, Gating) and analyzes the correlation between geometric drift metrics and generalization performance. Cross-dataset transfer experiments further validate the method's robustness, although performance is shown to depend on the semantic compatibility between source and target label spaces. The evaluation is comprehensive and convincing.
The paper provides detailed implementation details, including dataset splits, prompt templates, hyperparameters (learning rate, batch size, epochs), and backbone specifications (Pengi). The training objective and inference procedures are clearly defined. The authors also provide a parameter-matched comparison with CLIP-Adapter, ensuring fair evaluation. The code is not explicitly linked in the text provided, but the methodological description is sufficient for reproduction by experts in the field.
The authors acknowledge that the effectiveness of SubT is bounded by the quality of the underlying zero-shot representation. In highly specialized domains where the pretrained semantic prior is misaligned with the task, the transferred subspace may be less informative. Additionally, the number of learnable parameters scales linearly with the number of base classes, which may limit efficiency in very large-label settings compared to methods with fixed parameter counts. The gating mechanism, while beneficial on average, can be unstable in very low-dimensional regimes (e.g., Beijing-Opera with only 2 base classes).
This work contributes to the broader field of few-shot learning and multimodal representation learning by highlighting the importance of geometric constraints in preserving transferable knowledge. By demonstrating that SubT is modality-agnostic (tested on CLIP/ImageNet), the paper suggests potential applications beyond audio, such as vision-language models and other multimodal systems where few-shot adaptation is critical. The focus on generalization to unseen classes aligns with the goal of building more robust and adaptable AI systems. This paper presents a novel geometry-constrained adaptation framework, Subspace Tuning, which effectively mitigates the base-to-new generalization trade-off in Audio-Language Models by preserving the structural integrity of the text embedding space, leading to significant improvements in few-shot generalization across diverse audio benchmarks.
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University
NeuralMUSIC presents a robust hybrid framework for robot sound source localization by combining neural covariance estimation with classical subspace methods, achieving superior accuracy and generalization across diverse acoustic environments. The integration of self-supervised learning for spatial correlation and adaptive frequency fusion addresses critical challenges in low-SNR and broadband scenarios, offering a significant advancement in reliable robotic audition systems.
The paper proposes NeuralMUSIC, a hybrid framework that integrates a neural network for spatial covariance matrix estimation into the classical Multiple Signal Classification (MUSIC) algorithm. The approach addresses key limitations of classical MUSIC (noise sensitivity, broadband processing) and pure deep learning methods (black-box nature, poor generalization). Key innovations include: 1) A neural encoder to predict the spatial covariance matrix, which is then used in the standard MUSIC eigen-decomposition pipeline. 2) A Frequency Attention Fusion (FAF) module to adaptively weight frequency bins for broadband DOA estimation. 3) A self-supervised Spatial Correlation Learning (SSCL) strategy using masked channel reconstruction to leverage unlabeled data. 4) An adaptive source-number prediction module. The methodology is well-grounded in signal processing theory and effectively bridges the gap between model-based and data-driven approaches. The integration of SSCL is a particularly strong design choice for robotic applications where labeled data is scarce.
The authors conduct extensive experiments on four datasets: GSC (simulated), AV16.3 (real-world speaker), SLoClas (acoustic events), and AFPILD (pedestrian footsteps). They compare against classical methods (MUSIC, NormMUSIC, Beamforming, TOPS, FRIDA) and deep learning baselines (CRNN, Transformer, DOANet, DeepDAE, DeepMusic, DA-Music). Results show consistent improvements in Mean Absolute Angular Error (MAAE) across all datasets and configurations (single-source, multi-source, unknown source number). Ablation studies validate the contributions of FAF and SSCL. Additional experiments on data efficiency, SNR robustness, and cross-domain generalization (cross-room, cross-array) further demonstrate the method's robustness. The evaluation is comprehensive and convincing.
The paper provides detailed descriptions of the network architecture, loss functions, and experimental settings (STFT parameters, optimizer, hyperparameters). The code is made available on GitHub. The use of standard datasets (GSC, AV16.3, SLoClas, AFPILD) facilitates reproduction. The description of the SSCL masking strategies and the hybrid pipeline is clear.
The method relies on a neural network to estimate the covariance matrix, which may still suffer from distribution shifts if the acoustic environment or array geometry differs significantly from the training data, although the paper shows some resilience. The performance on AFPILD, while best among baselines, is still relatively high in error (10.24 degrees), indicating challenges with footstep sounds. The cross-array generalization, while better than pure DL methods, is not perfect and degrades with large geometry mismatches. The paper acknowledges these in the limitations section.
This work contributes to the field of robot audition and spatial audio processing. By providing a robust, data-efficient, and interpretable solution for sound source localization, it enables more reliable autonomous robots in dynamic environments. The hybrid approach offers a template for integrating physical priors with deep learning in other signal processing tasks. The self-supervised learning strategy is broadly applicable to other domains with limited labeled data. NeuralMUSIC presents a robust hybrid framework for robot sound source localization by combining neural covariance estimation with classical subspace methods, achieving superior accuracy and generalization across diverse acoustic environments. The integration of self-supervised learning for spatial correlation and adaptive frequency fusion addresses critical challenges in low-SNR and broadband scenarios, offering a significant advancement in reliable robotic audition systems.
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.
Primary: Athens University of Economics and Business
All Institutions: Athens University of Economics and Business, Orfium Research, Hellenic Mediterranean University, NCSR ``Demokritos'', Archimedes / Athena Research Center
This paper presents a theoretically sound and practically effective method for overcoming sparsity-induced discontinuities in activation steering by applying PID control theory, significantly enhancing the smoothness and precision of inference-time control in symbolic music generation.
The paper proposes a novel application of Proportional-Integral-Derivative (PID) control theory to the problem of activation steering in symbolic music generation. Specifically, it addresses a critical failure mode in Sparse Activation Steering (SAS) where the Top-K sparsity constraint creates a binary threshold barrier, preventing smooth, gradual steering (e.g., via cosine ramps) because small steering magnitudes are zeroed out. The authors introduce "Temporal PID," a closed-loop controller that operates at each autoregressive generation step. The integral term accumulates error (the difference between the desired feature magnitude and the actual surviving feature magnitude) to dynamically increase the steering magnitude until it breaches the Top-K threshold, after which the derivative term dampens overshoot. They also validate "Spatial PID" across layers, confirming previous findings in shallower architectures. The methodology is theoretically grounded in control theory and mechanistic interpretability (Linear Representation Hypothesis), offering a rigorous mathematical framework for inference-time control that is distinct from standard static vector addition or dense steering methods.
The experiments are conducted on the Multitrack Music Transformer (MMT) using the Symbolic Orchestral Database (SOD). The evaluation covers single-concept steering (pitch, duration) and dual-concept steering (simultaneous pitch and duration). The results demonstrate that Temporal PID significantly reduces quality degradation (measured by entropy, scale consistency, and groove consistency) compared to static SAS baselines while achieving comparable or superior steering success rates. The paper provides detailed ablation studies showing the necessity of the Integral term (P-only fails to overcome the threshold) and the benefit of the Derivative term. It also includes a "Round-Trip" steering experiment, demonstrating the reversibility and smoothness of the control, which is impossible with static methods. The use of Fréchet Music Distance (FMD) and specific musical metrics adds robustness to the evaluation, although the sample sizes are noted as modest.
The paper provides sufficient detail for reproduction, including SAE architecture details (512x4096, Top-K=128), training data sources (SOD), and hyperparameters for the PID controller (gains, ramp lengths). The codebase for MMT and SOD is publicly available, and the specific steering vector construction (DiffMean) and injection strategies are described. The main limitation for immediate reproduction is the lack of released code, but the methodological description is precise enough for an experienced researcher to implement.
The primary limitations are the evaluation on a single model architecture (MMT) and dataset (SOD), which limits the generalizability of the findings to other symbolic music models or audio domains. The sample sizes for some experiments (e.g., n=20 for dual steering) are small, potentially affecting statistical significance. The paper acknowledges the absence of perceptual validation (e.g., MUSHRA tests), relying solely on objective metrics. Additionally, the "Dual-Concept" steering requires expanding the Top-K budget, which is a specific constraint of the SAE implementation used and may not be optimal for all sparse autoencoder architectures.
This work advances the field of controllable generative AI, particularly in symbolic music, by providing a method for fine-grained, interpretable, and smooth control without retraining. This has implications for creative tools, allowing musicians to interactively shape generated music. The application of control theory to mechanistic interpretability is a novel cross-disciplinary contribution that could inspire similar approaches in other discrete sequence generation tasks. The dual-use risk of unauthorized style imitation is acknowledged, and the authors mitigate this by releasing only the method, not artist-specific vectors. This paper presents a theoretically sound and practically effective method for overcoming sparsity-induced discontinuities in activation steering by applying PID control theory, significantly enhancing the smoothness and precision of inference-time control in symbolic music generation.
While Audio Large Language Models (Audio LLMs) excel at multimodal understanding, they suffer from text dominance, a bias where models blindly favor text over acoustic evidence, causing hallucinations. However, the internal mechanisms underlying how these models behave when audio and textual inputs contradict each other remain unexplored. In this work, we present the first mechanistic analysis of this phenomenon by tracing the propagation of internal representations across layers. Our investigation reveals three key findings: (i) text dominance is systematically and empirically across models; (ii) while text and audio rely on functionally distinct pathways, they ultimately converge into a shared semantic space in late layers; and (iii) the text pathway does not erase audio information, but rather actively suppresses intact audio representations. Building on these insights, we leverage back-patching, a training-free intervention that routes late-layer audio activations back into earlier layers. This amplifies the audio representations, enabling them to overcome textual suppression. Our evaluation shows that back-patching consistently reduces text dominance, paving the way for mechanistic multimodal alignment under conflict.
Primary: Unknown
All Institutions: Unknown
This paper provides the first mechanistic analysis of text dominance in Audio LLMs, revealing that text actively suppresses audio representations, and proposes a training-free back-patching intervention to mitigate this bias, offering valuable insights for building more balanced multimodal systems.
The paper proposes a mechanistic interpretability study of Audio Large Language Models (Audio LLMs), specifically targeting the phenomenon of "text dominance." The core methodological contribution is the use of activation tracing to map how textual and acoustic representations propagate through the model layers. The authors identify that while modalities start in distinct pathways, they converge in late layers, with text actively suppressing audio representations rather than merely ignoring them. To mitigate this, they introduce "back-patching," a training-free intervention that routes late-layer audio activations back into earlier layers to amplify audio signals. This approach is technically sound and leverages established techniques from mechanistic interpretability (like activation patching) applied to a novel multimodal context. The methodology is rigorous in its diagnostic capability, offering clear insights into the internal dynamics of multimodal fusion.
The experimental evaluation focuses on diagnosing the text dominance bias and validating the efficacy of the back-patching intervention. The results indicate that back-patching consistently reduces text dominance across tested models. However, the abstract and limited context suggest the evaluation is primarily diagnostic and corrective rather than performance-maximizing. The paper likely demonstrates that the intervention works to balance modalities, but it may not show state-of-the-art performance on standard multimodal benchmarks compared to fully trained models. The strength lies in the empirical verification of the mechanistic hypothesis (suppression vs. erasure) and the proof-of-concept for the intervention. The lack of extensive benchmark comparisons limits the assessment of practical utility versus architectural changes.
The paper describes a training-free intervention, which generally enhances reproducibility as it does not require retraining large models. The use of standard mechanistic interpretability tools (activation tracing, patching) suggests that the codebase is likely modular and accessible if the authors provide it. However, the "Unknown" institution and lack of explicit code links in the provided text mean reproducibility relies on the authors' willingness to open-source. The methodology is clear enough to be implemented by researchers familiar with transformer internals.
A primary limitation is the scope of the evaluation. The paper focuses on a specific bias (text dominance) and a specific mitigation (back-patching). It does not necessarily improve overall accuracy on complex multimodal reasoning tasks where audio-text alignment is crucial but not contradictory. Furthermore, back-patching is a post-hoc intervention; it may not be optimal for all types of conflicts or for tasks requiring deep semantic integration of audio and text. The generalizability of the "suppression" finding to all Audio LLM architectures (e.g., those with different fusion mechanisms like cross-attention vs. late fusion) is not explicitly detailed in the abstract, though likely explored in the full text.
This work has significant implications for the development of reliable multimodal AI systems. By revealing the internal mechanisms of text dominance, it provides a roadmap for building more balanced and trustworthy Audio LLMs. The findings challenge the assumption that multimodal models simply average or concatenate features, highlighting the need for explicit architectural or training interventions to ensure audio evidence is not systematically overridden by text. This contributes to the broader field of AI safety and reliability, particularly in high-stakes applications where audio evidence (e.g., in medical or legal contexts) must be weighed equally with textual descriptions. This paper provides the first mechanistic analysis of text dominance in Audio LLMs, revealing that text actively suppresses audio representations, and proposes a training-free back-patching intervention to mitigate this bias, offering valuable insights for building more balanced multimodal systems.
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract target-speaker representations, keeping the decoder frozen. We instantiate this as Dixtral, integrating a Diarization Conditioned Whisper (DiCoW) encoder into the Voxtral SLM. On AMI, NOTSOFAR-1, LibriSpeechMix, and Mixer6, Dixtral outperforms Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2 on speaker-attributed transcription by 29.0%, 19.8%, and 16.0% absolute cpWER respectively. On a novel long-form multi-speaker QA benchmark, zero-shot Dixtral matches Gemini on far-field content understanding, and when fine-tuned surpasses both Gemini and Voxtral operating on close-talk across all tasks.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Carnegie Mellon University
The paper presents a compelling and effective method for grounding spoken LLMs in multi-speaker audio through encoder-side diarization conditioning, achieving state-of-the-art performance on transcription and novel capabilities in multi-speaker reasoning and QA.
The paper proposes a novel architectural strategy for extending Spoken Large Language Models (SLMs) to multi-speaker scenarios by conditioning the acoustic encoder on diarization masks, rather than adapting the decoder. This approach, instantiated as Dixtral, integrates a Diarization Conditioned Whisper (DiCoW) encoder with a frozen Voxtral decoder. The core innovation lies in the "Diarization Conditioning" mechanism, which uses frame-level speaker activity probabilities (STNO masks) to modulate internal representations via learnable affine transformations (FDDT). This allows the model to extract target-speaker representations while keeping the LLM decoder frozen, thereby avoiding catastrophic forgetting of reasoning capabilities associated with Serialized Output Training (SOT) and vocabulary expansion. The methodology is theoretically sound, offering a computationally efficient alternative ($O(S N^2)$ vs $O((SN)^2)$) for multi-speaker decoding.
The evaluation is comprehensive, covering four standard multi-speaker ASR datasets (AMI, NOTSOFAR-1, LibriSpeechMix, Mixer6) and a novel long-form QA/Summarization benchmark (NSF-QA). Dixtral demonstrates significant improvements over strong baselines, including Gemini 3.0 Flash, VibeVoice, and Voxtral Mini Transcribe V2, with absolute cpWER reductions of 16-29%. The inclusion of a paralinguistic QA task (emotion/gender) is particularly strong, as it tests the model's ability to utilize audio features beyond text, which cascaded systems cannot do. The results are robust, showing that zero-shot Dixtral matches Gemini on content QA and surpasses it when fine-tuned. The out-of-domain performance on Mixer6 further validates generalization.
The authors provide open-source code and a new dataset (NSF-QA). Training details are well-specified, including hardware constraints (8x A5000), optimization settings, and data chunking strategies. The use of established backbones (Whisper, Ministral, DiariZen) and clear integration points (FDDT, MLP adapter) ensures high reproducibility. The release of the benchmark dataset is a significant contribution to reproducibility in this niche.
The performance is inherently dependent on the quality of the external diarization system (DiariZen). Errors in diarization will propagate directly to the transcription and reasoning tasks. The paper acknowledges this but does not extensively analyze the sensitivity to diarization errors. Additionally, the current implementation requires separate inference passes for each target speaker, which, while more efficient than joint decoding, still scales linearly with the number of speakers. The fine-tuning for QA/Summarization slightly degrades pure ASR performance, indicating a trade-off that requires careful multi-task optimization in future work.
This work significantly advances the field of spoken language understanding by enabling end-to-end, multi-speaker reasoning in far-field audio. It bridges the gap between modular ASR pipelines and unified SLMs, offering a path towards more robust and capable voice assistants and meeting transcription tools. The ability to handle paralinguistic information (emotion, gender) in a multi-speaker context opens new avenues for affective computing and human-computer interaction. The paper presents a compelling and effective method for grounding spoken LLMs in multi-speaker audio through encoder-side diarization conditioning, achieving state-of-the-art performance on transcription and novel capabilities in multi-speaker reasoning and QA.
Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matching decoders offer superior quality but suffer from high inference latency due to iterative sampling, creating a severe quality-speed trade-off. In this paper, we propose a novel Token2Wav architecture that overcomes this limitation by applying MeanFlow in a highly compressed latent space. By modeling the average velocity rather than the instantaneous velocity field, MeanFlow enables true one-step generation. Operating in the latent domain mitigates the memory and stability issues of waveform-level flows, yielding up to a 17$\times$ improvement in Real-Time Factor (RTF) compared to multi-step baselines with negligible quality degradation. Furthermore, we introduce refinement strategies that mitigate latent mismatch, including decoder-only fine-tuning with the MeanFlow generator frozen and end-to-end joint fine-tuning, improving fidelity without increasing inference-time cost. Code and demo are publicly available.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, LIGHTSPEED, Tencent, The Hong Kong University of Science and Technology
This paper presents a significant advancement in low-latency neural audio synthesis by successfully adapting MeanFlow to a latent space for one-step Token-to-Waveform generation, achieving a substantial reduction in inference latency while maintaining high perceptual quality through innovative refinement strategies.
The paper proposes a novel architecture for Token-to-Waveform (Token2Wav) generation by combining MeanFlow (a one-step flow matching variant) with a latent space representation. The core innovation lies in applying MeanFlow in a compressed latent space rather than waveform space, which addresses the stability and memory issues typically associated with one-step generation at high resolutions. The methodology includes a lightweight VAE for latent encoding/decoding and a DiT-1D model for the one-step latent generation. Crucially, the authors address the "latent mismatch" problem—where generated latents deviate from the VAE's training distribution—through two refinement strategies: decoder-only fine-tuning and end-to-end joint fine-tuning. This is a technically sound approach that effectively bridges the gap between the efficiency of autoregressive/LLM-based token generation and the quality of continuous flow-based vocoders. The use of MeanFlow to eliminate iterative sampling is a significant methodological contribution to low-latency audio synthesis.
The experimental setup is rigorous, utilizing LibriTTS for training and LibriSpeech test-clean for evaluation. The authors compare their method against CosyVoice2's Token2Wav module, a strong multi-step flow baseline. They report Real-Time Factor (RTF), Word Error Rate (WER), Speaker Similarity (SpkSim), and perceptual metrics (UTMOS, MOS). The results demonstrate a substantial speedup (17x RTF improvement) with negligible degradation in quality metrics. The ablation studies are particularly valuable, analyzing the impact of latent dimensionality, model capacity, and the specific refinement strategies. The finding that larger models do not necessarily perform better in one-step generation is an insightful empirical observation. The inclusion of both objective and subjective metrics strengthens the evaluation.
The paper provides clear descriptions of the model architectures, loss functions, and training procedures. The code and demo are publicly available on GitHub, which significantly enhances reproducibility. The authors specify the tokenization scheme (CosyVoice2 tokenizer), speaker embedding extractor (CAM++), and evaluation protocols. The detailed breakdown of RTF and the specific hyperparameters for the VAE and DiT models allow for faithful reproduction of the results.
The primary limitation is the reliance on a pre-trained VAE, which introduces a fixed bottleneck in terms of information loss during encoding. While the refinement strategies mitigate this, they do not eliminate the fundamental trade-off between compression ratio and fidelity. Additionally, the performance is evaluated primarily on LibriSpeech, which consists of read speech; performance on conversational or noisy speech is not reported. The "one-step" nature also implies that the model cannot easily refine outputs iteratively, which might be a limitation for applications requiring high-fidelity correction of errors. The paper notes that larger models did not improve performance, suggesting that the one-step constraint is difficult to scale without further architectural or training innovations.
This work has significant implications for the deployment of LLM-based TTS systems, particularly in latency-sensitive applications like real-time voice assistants, gaming, and on-device AI. By enabling high-quality, one-step audio generation, it reduces computational costs and energy consumption, making advanced speech synthesis more accessible. The approach could also be extended to other modalities where low-latency generation is critical. However, as with all audio generation technologies, there are potential misuse cases regarding voice cloning and deepfakes, necessitating responsible deployment practices. This paper presents a significant advancement in low-latency neural audio synthesis by successfully adapting MeanFlow to a latent space for one-step Token-to-Waveform generation, achieving a substantial reduction in inference latency while maintaining high perceptual quality through innovative refinement strategies.
This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level BERT into a phoneme-level BERT. This design provides prior word- and sentence-level information, reducing the amount of pre-training required by the phoneme encoder. Subjective listening evaluations show that CraBERT achieves MOS values comparable to existing PPEncs after approximately one epoch of pre-training, whereas the baselines in our comparison are pre-trained for approximately ten epochs. These results demonstrate that CraBERT can efficiently learn representations suitable for improving the perceived naturalness and prosody of synthesized speech.
Primary: The University of Tokyo
All Institutions: The University of Tokyo
This paper introduces CraBERT, an efficient phoneme encoder that significantly reduces pre-training time while maintaining high-quality speech synthesis. The innovative integration of subword representations and the development of a new alignment algorithm mark a notable advancement in the field of text-to-speech technologies.
The methodology presented in this paper is innovative in its use of a cascade-fusion architecture that integrates subword representations from a pre-trained BERT model into a phoneme-level BERT. This approach addresses the inefficiencies of traditional phoneme encoders by leveraging existing word- and sentence-level information, significantly reducing the pre-training time required for effective phoneme representation. The introduction of a data-driven subword-phoneme alignment algorithm based on dynamic time warping (DTW) further enhances the methodology, providing a systematic way to fuse these representations.
The experimental evaluation is robust, employing subjective listening tests to assess the quality of synthesized speech using mean opinion scores (MOS). The results indicate that CraBERT achieves comparable performance to existing phoneme encoders after a fraction of the pre-training time, demonstrating its efficiency. The use of a multi-speaker dataset from the LibriTTS corpus adds credibility to the findings, although more diverse datasets could strengthen the generalizability of the results.
The paper provides detailed descriptions of the architecture, pre-training processes, and experimental setups, which are essential for reproducibility. However, the lack of publicly available code or a project repository limits the ease with which other researchers can replicate the results. The authors should consider releasing their implementation to enhance reproducibility.
One limitation is the reliance on a single pre-trained model (DistilBERT) for subword representations, which may not generalize across all languages or phonetic systems. Additionally, while the subjective evaluations show promising results, they are limited to a specific dataset and may not reflect performance across different languages or dialects. The paper also does not explore the potential for further optimization of the alignment algorithm.
The implications of this research are significant for the field of text-to-speech synthesis, particularly in improving the efficiency of phoneme encoders. The advancements in pre-training methodologies could lead to more accessible and faster TTS systems, which can be beneficial in various applications, including virtual assistants, audiobooks, and language learning tools. The approach could also inspire further research into efficient representation learning in other domains. This paper introduces CraBERT, an efficient phoneme encoder that significantly reduces pre-training time while maintaining high-quality speech synthesis. The innovative integration of subword representations and the development of a new alignment algorithm mark a notable advancement in the field of text-to-speech technologies.
Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.
Primary: KTH Royal Institute of Technology
All Institutions: KTH Royal Institute of Technology
The main contribution of this paper is the introduction of MuVAP, a causal multimodal framework that effectively predicts turn-taking in multiparty conversations using a single audio stream and a single camera view. This innovative approach, combined with the creation of the AVCC dataset, addresses critical limitations in existing methods and has the potential to advance the field of conversational AI significantly.
The paper presents MuVAP, a novel multimodal framework that integrates audio and visual data to predict turn-taking in multiparty conversations. The methodology is well-structured, introducing Role-Relative Projection to simplify the complexity of multiparty interactions by focusing on the current and next speaker. The use of a single audio channel and a single camera view is a significant departure from traditional methods, which often require complex setups. The introduction of the Audio-Visual Conversation Corpus (AVCC) is a crucial contribution, as it provides a dataset specifically designed for this type of analysis, addressing the limitations of existing datasets.
The experiments are comprehensive, comparing MuVAP against strong baselines across various tasks, including Shift-Hold and Next Speaker Prediction. The results demonstrate that MuVAP outperforms these baselines, showcasing its effectiveness in real-world scenarios. The evaluation metrics used, such as Macro-F1 for Shift-Hold Prediction and accuracy for Next Speaker Prediction, are appropriate for the tasks at hand. The paper provides detailed results and analysis, indicating a thorough evaluation process.
The paper includes sufficient implementation details, including the architecture of the model, training procedures, and the datasets used. However, the lack of a public demo or clear access to the trained models may hinder full reproducibility. The GitHub repository provides some resources, but additional documentation would enhance reproducibility.
The paper acknowledges limitations, such as the class imbalance introduced by the Role-Relative Projection and the reliance on visual tracking that may miss subtle facial cues. Additionally, the model's performance is evaluated primarily on two- and three-speaker settings, which may not fully represent larger group dynamics. The authors also note the potential for improved performance with more advanced visual encoders.
The implications of this research are significant for human-robot interaction and conversational AI, as it enables more natural and responsive interactions in multiparty settings. The ability to predict turn-taking dynamics using standard hardware makes this approach accessible for various applications, including social robotics, virtual assistants, and interactive media. The main contribution of this paper is the introduction of MuVAP, a causal multimodal framework that effectively predicts turn-taking in multiparty conversations using a single audio stream and a single camera view. This innovative approach, combined with the creation of the AVCC dataset, addresses critical limitations in existing methods and has the potential to advance the field of conversational AI significantly.
Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.
Primary: Carnegie Mellon University Africa
All Institutions: Carnegie Mellon University Africa
The main contribution of this paper is the identification of training configuration as the primary cause of quality degradation in neural audio codecs at low frame rates, challenging previous assumptions about the inherent limitations of such systems. This work offers a novel perspective on codec design, emphasizing the importance of training methodologies in achieving efficient and intelligible audio synthesis.
The authors employ a controlled ablation study to investigate the effects of low frame rates on neural audio codecs, specifically focusing on the training configurations that lead to performance degradation. They systematically analyze potential causes for the observed quality cliff, such as phonemic collisions and codebook saturation, and identify a training misconfiguration as the primary issue. This approach is methodologically sound, as it combines theoretical analysis with empirical validation, allowing for clear conclusions about the limitations of current training practices.
The experiments are well-structured, utilizing a range of frame rates and comparing the performance of various codecs on established benchmarks such as WER, STOI, and SPK-SIM. The use of a comprehensive dataset (LibriSpeech) and the evaluation of multiple metrics provide a robust assessment of codec performance across different configurations. The results clearly illustrate the impact of training configuration on codec intelligibility at low frame rates, thus contributing valuable insights into the design of future codecs.
The paper provides sufficient details regarding the training process, model architectures, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the lack of publicly available code or models limits the ease of reproducibility. Including a project URL with the code would enhance this aspect significantly.
One limitation is the focus on a specific dataset (LibriSpeech), which may not generalize to all audio synthesis tasks or languages. Additionally, while the authors identify training configuration as a key factor, they do not explore other potential architectural modifications that could further improve performance at low frame rates. The paper also lacks a discussion on the computational costs associated with training and inference at these low frame rates.
The findings have significant implications for the design of neural audio codecs, particularly in applications where inference efficiency is critical, such as real-time speech synthesis and low-latency communication systems. By demonstrating that low frame rates can be utilized effectively with appropriate training strategies, this work paves the way for more efficient audio processing technologies in various domains. The main contribution of this paper is the identification of training configuration as the primary cause of quality degradation in neural audio codecs at low frame rates, challenging previous assumptions about the inherent limitations of such systems. This work offers a novel perspective on codec design, emphasizing the importance of training methodologies in achieving efficient and intelligible audio synthesis.
Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.
Primary: IIIT-Delhi
All Institutions: IIIT-Delhi, UPES, VBSPU
This paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight Small-ALM that outperforms existing models while addressing practical deployment challenges. The technical contributions are significant, with a strong focus on methodology and experimental validation, positioning this work as a valuable asset in the field of audio deepfake detection.
The methodology presented in the paper is robust, introducing the SEA-CF benchmark for CF detection in SEA languages, which is a significant advancement given the lack of resources in this area. The authors propose GARUDA, a lightweight Small-ALM that effectively combines dual-encoder architectures to capture semantic and prosodic features, which is innovative. The use of JS divergence as a loss function for aligning representations is a novel approach that enhances the model's performance. Overall, the methodology is well-structured and addresses practical deployment challenges.
The experimental evaluation is comprehensive, utilizing both zero-shot and fine-tuned settings to assess the performance of GARUDA against existing SOTA models. The results demonstrate significant improvements over baselines, with rigorous statistical testing (McNemar’s test) validating the findings. The paper effectively highlights the necessity of in-domain training and the limitations of existing models when applied to SEA languages, underscoring the importance of the proposed SEA-CF benchmark.
The paper provides sufficient details regarding the dataset construction, model architecture, and training procedures, which enhances reproducibility. The authors mention the use of publicly available datasets and provide a project URL for accessing the SEA-CF benchmark, which is crucial for other researchers looking to replicate or build upon this work.
While the paper makes significant contributions, it acknowledges limitations such as the incomplete coverage of all SEA languages and the current restriction of evaluations to available benchmarks. Future work is needed to expand the dataset and improve generalization across diverse generators.
The work has substantial implications for enhancing security against audio deepfakes in low-resource language contexts, addressing a critical gap in the current landscape of speech technology. By focusing on SEA languages, the research promotes inclusivity and provides tools that can be vital for protecting vulnerable communities against audio fraud. This paper introduces SEA-CF, the first large-scale benchmark for CF detection in SEA languages, and proposes GARUDA, a lightweight Small-ALM that outperforms existing models while addressing practical deployment challenges. The technical contributions are significant, with a strong focus on methodology and experimental validation, positioning this work as a valuable asset in the field of audio deepfake detection.
A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{https://github.com/cactuswiththoughts/DDPO-VC}{https://github.com/cactuswiththoughts/DDPO-VC}} and demo\footnote{\href{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}{https://cactuswiththoughts.github.io/SpeakerDeID-Demo/}}.
Primary: MIT CSAIL
All Institutions: MIT CSAIL, Boston University
The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
The proposed DDPO-VC framework effectively integrates reinforcement learning with diffusion models to address the dual challenge of privacy and utility in speaker de-identification. The methodology is well-structured, leveraging a conditional diffusion model and a novel reward mechanism that utilizes both privacy and utility teachers. This innovative approach allows for a more nuanced optimization of the privacy-utility tradeoff, which is critical in sensitive applications such as healthcare. The use of reinforcement learning to navigate complex correlations between variables is a significant advancement over traditional disentanglement methods.
The experiments are robust, utilizing two dementia speech benchmarks that are relevant and challenging. The results demonstrate clear superiority over existing methods in both privacy preservation and cognitive utility, with well-defined metrics such as AUC and EER. The comprehensive evaluation across multiple settings (zero-shot and fine-tuned) adds credibility to the findings. However, further details on the datasets and the specific configurations used in experiments would enhance the clarity of the evaluation.
The paper provides a GitHub repository and demo link, which is a positive aspect for reproducibility. However, the implementation details could be more explicit, particularly regarding hyperparameters and training procedures, to ensure that other researchers can replicate the results accurately.
One limitation noted is the potential for reward hacking due to the fixed nature of the privacy teacher. Additionally, the reliance on pretrained models for the privacy and utility teachers may limit the generalizability of the approach to other domains. The paper also acknowledges the need for more diverse evaluation metrics beyond naturalness and speaker similarity, indicating room for improvement in the evaluation framework.
The implications of this research are significant, particularly in fields where privacy is paramount, such as healthcare. By improving speaker de-identification methods, the framework can help protect sensitive information while still allowing for the utility of speech data in applications like dementia diagnosis and monitoring. The potential for broader applications in other audio domains and utility variables further enhances its relevance. The main contribution of this paper is the introduction of DDPO-VC, a novel framework for speaker de-identification that balances privacy and utility through reinforcement learning and diffusion models. This work represents a significant advancement in the field, addressing critical challenges in the intersection of privacy and cognitive utility in speech processing.
Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at https://muzw.github.io/duramark_demo/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Institute of Forensic Science, Ministry of Public Security, The Hong Kong Polytechnic University
The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
The proposed DuraMark framework introduces a novel approach to watermarking in LLM-based TTS systems by embedding watermarks at the information level through syllable duration editing. This method is innovative as it leverages a duration-controllable TTS model and a duration extractor, which allows for precise control over the watermarking process while maintaining the naturalness of the synthesized speech. The integration of these components is well-structured, and the methodology is clearly articulated, allowing for a thorough understanding of the process.
The experiments conducted are robust, utilizing a substantial dataset and comparing DuraMark against established signal-level watermarking methods. The evaluation metrics include True Positive Rate (TPR) under various attack scenarios, which is a relevant measure of robustness. The results demonstrate DuraMark's superior performance, particularly against generative attacks, which is a critical aspect of the paper's claims. The use of both objective and subjective metrics to assess speech naturalness further strengthens the experimental evaluation.
The paper provides sufficient detail regarding the experimental setup, including the datasets used and the training parameters. However, the absence of a public code repository limits reproducibility. While the methodology is clearly described, access to the code would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on a specific language (Mandarin Chinese) for the experiments, which may affect the generalizability of the findings to other languages or dialects. Additionally, while the paper demonstrates robustness against various attacks, it does not explore the performance of DuraMark under more extreme or novel attack scenarios that may arise in real-world applications.
The implications of this research are significant, particularly in the context of combating deepfake technologies and ensuring the integrity of synthesized speech. The DuraMark framework could be applied in various fields, including media, security, and digital forensics, where the authenticity of audio content is crucial. The potential for this technology to enhance trust in AI-generated content is noteworthy. The main contribution of this paper is the introduction of DuraMark, a novel generative watermarking framework that embeds watermarks into synthesized speech by editing syllable durations, significantly improving robustness against generative attacks while preserving speech quality. This work represents a meaningful advancement in the field of audio processing and watermarking, addressing critical concerns related to deepfake technologies and the integrity of synthesized speech.
Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient style learning and thus limiting speaker similarity in synthesized speech. To this end, we investigate the prosody learning conditioned on the synthesized speech, and propose to predict the prosody of the current syllable based on previously predicted speech. Experimental results obtained on three datasets demonstrated the efficacy of the proposed dynamic prosody prediction method in enhancing the prosody learning capability, thereby improving the speaker similarity of the generated speech. Audio samples are available at https://muzw.github.io/dynapros/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, iFLYTEK
The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
The proposed dynamic prosody prediction method represents a significant advancement in TTS technology by allowing for syllable-level prosody prediction based on previously generated speech. This approach addresses the limitations of existing methods that typically rely on static prosody modeling. The integration of prosody prediction into the speech generation process is well-justified and demonstrates a clear understanding of the challenges in personalized TTS systems. The methodology is sound, with a clear architecture and loss function defined, although the paper could benefit from more detailed explanations of the equations presented.
The experiments are comprehensive, utilizing three diverse datasets that cover a range of emotional and stylistic variations. The results are presented clearly, showing improvements in speaker similarity and prosody modeling capabilities. The use of both objective metrics (e.g., CER, emotion similarity) and subjective evaluations (e.g., MOS, preference tests) adds robustness to the findings. However, the paper could enhance its credibility by providing more detailed statistical analyses of the results, such as confidence intervals or significance testing.
The paper provides sufficient details regarding the experimental setup, including the datasets used, model architectures, and training procedures. The availability of the CosyVoice implementation and audio samples supports reproducibility. However, the lack of specific hyperparameter settings and training configurations for the proposed model could hinder complete reproducibility.
One limitation of the study is its focus on Mandarin Chinese, which may restrict the applicability of the findings to other languages or dialects. Additionally, while the proposed method shows promise in improving speaker similarity, the paper does not address potential challenges in real-world applications, such as the computational efficiency of the model during inference.
The proposed method has significant implications for the development of personalized TTS systems, particularly in applications such as virtual assistants, audiobooks, and entertainment. By improving speaker similarity, the approach could enhance user experience and engagement in various audio-related applications. Furthermore, the findings may inspire further research into dynamic prosody modeling in other languages and contexts. The main contribution of this paper is the introduction of a dynamic prosody prediction method that enhances speaker similarity in personalized TTS systems. This innovative approach, supported by comprehensive experimental validation, addresses key limitations in existing TTS technologies and has the potential to significantly impact the field of speech synthesis.
Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland
This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
The proposed methodology introduces a phoneme-guided cross-attention framework that significantly enhances the interpretability of speech deepfake detection systems. By leveraging phonetic posteriorgrams (PPGs) as a structural interface, the framework allows for a detailed analysis of the contribution of each phonetic class to the detection decision. This contrasts with traditional models that produce a single score without insight into the phonetic structure. The probabilistic factorization of the spoofing posterior into per-phone contributions is a novel approach that provides a clear, interpretable mechanism for understanding model behavior, which is a significant advancement in the field of explainable AI in speech processing.
The experimental evaluation is robust, utilizing three datasets of varying complexity, including a controlled corpus and standard benchmarks like ASVspoof 2019 LA. The results demonstrate competitive performance while also providing insights into the discriminative power of different phonetic categories. The targeted phoneme-group ablation study further validates the importance of articulatory categories, confirming the model's ability to isolate and rank the contributions of different phonetic classes effectively.
The paper lacks explicit details regarding the implementation and availability of the code or models, which raises concerns about reproducibility. While the methodology is well-documented, the absence of a publicly accessible repository or demo limits the ability for other researchers to validate and build upon the findings.
One limitation is the reliance on the quality of the phonetic posteriorgrams, which may introduce noise or inaccuracies if the phoneme extraction process is not robust. Additionally, while the model shows promise in structured interpretability, it may still struggle with complex, real-world scenarios where the phonetic structure is less clear. The paper does not address potential biases in the datasets used for training and evaluation.
The implications of this work are significant, particularly in the context of forensic voice analysis and anti-spoofing measures in security systems. By enhancing the interpretability of deepfake detection, the framework could facilitate more reliable applications in legal and security settings, where understanding the basis of decisions is crucial. Furthermore, the integration of phonetic structures into detection systems may inspire new research avenues in both speech synthesis and recognition. This paper presents a novel phoneme-guided cross-attention framework for speech deepfake detection, significantly enhancing interpretability and performance. The methodology effectively integrates phonetic structures into the detection process, providing a clear basis for understanding model decisions and contributing valuable insights to the field of audio processing and explainable AI.
Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology, Korea Advanced Institute of Science and Technology, Shanghai Jiaotong University
The main contribution of this paper is the introduction of AudioDER, a reasoning-oriented dataset designed to enhance the post-training of large audio-language models through a novel redundancy-aware construction pipeline. This work significantly advances the field by addressing the challenges of dataset redundancy and providing a comprehensive resource for improving audio reasoning capabilities in LALMs.
The proposed methodology is robust, focusing on a redundancy-aware data construction pipeline that effectively enhances the quality and diversity of training data for LALMs. The multi-stage process, which includes acoustic similarity-based deduplication, integration of existing annotations, and generation of CoT rationales, is well-structured and addresses key challenges in audio reasoning. The use of Qwen3-30B for rationale generation is particularly innovative, as it combines language understanding with audio processing to create a comprehensive dataset. The methodology is clearly articulated, with a logical flow from data collection to final dataset construction.
The experimental evaluation is thorough, demonstrating the effectiveness of the AudioDER dataset through extensive post-training experiments on multiple audio reasoning benchmarks. The results show consistent improvements in performance across various models, indicating that the dataset is not only well-constructed but also impactful in enhancing reasoning capabilities. The benchmarks chosen (MMAU-mini, MMSU, and MMAR) are relevant and challenging, providing a solid basis for evaluating the dataset's effectiveness.
The paper provides sufficient implementation details, including the architecture used (Qwen2-Audio-7B-Instruct), training parameters, and the experimental setup. However, the lack of a publicly available demo or interactive component limits the ease of reproducibility for external researchers. The open-source nature of the dataset is a positive aspect that encourages further exploration and validation by the community.
One limitation is the reliance on existing datasets for annotations, which may introduce biases inherent in those sources. Additionally, while the redundancy filtering process is beneficial, it may inadvertently remove samples that could contribute valuable diversity. The paper does not address potential scalability issues related to the dataset size or the computational resources required for post-training on larger models.
The AudioDER dataset has significant potential for advancing research in audio reasoning and LALMs. By providing a high-quality, structured dataset, it can facilitate the development of more capable audio understanding systems, which could have applications in various fields such as accessibility, education, and entertainment. The emphasis on reducing redundancy also highlights a critical area for improvement in dataset construction practices across machine learning. The main contribution of this paper is the introduction of AudioDER, a reasoning-oriented dataset designed to enhance the post-training of large audio-language models through a novel redundancy-aware construction pipeline. This work significantly advances the field by addressing the challenges of dataset redundancy and providing a comprehensive resource for improving audio reasoning capabilities in LALMs.
Turn-taking in multi-party spoken conversations remains a fundamental challenge for voice-based agents, particularly under dynamic floor competition and varying user expectations. We propose ModeratorLM, a role-playing voice agent that conditions turn-taking behavior on an explicitly assigned role in multi-party settings. The system is built on a speech large language model operating in chunk-wise streaming manner. We further introduce a reasoning-augmented variant that incorporates chain-of-thought reasoning over conversational context and the assigned role. We construct RolePlayConv, a large-scale synthetic dataset of spoken multi-party conversations with diverse assistant roles. Experiments on real-world meeting data and RolePlayConv show improved turn-taking precision by over 40% and recall by more than 70%, while substantially reducing false-positive interruptions compared to non-role-conditioned baselines.
Primary: Amazon AGI
All Institutions: Amazon AGI, IIT Kharagpur
The main contribution of this paper is the introduction of ModeratorLM, a role-playing voice agent that enhances turn-taking in multi-party conversations through role conditioning and reasoning. This work represents a significant advancement in the field of conversational AI, addressing a critical challenge in multi-party interactions and providing a novel dataset for future research.
The proposed methodology introduces ModeratorLM, a role-playing voice agent that utilizes a speech large language model (LLM) to manage turn-taking in multi-party conversations. The approach is innovative in its use of role conditioning to influence turn-taking behavior, which is a significant advancement over traditional models that do not consider role dynamics. The integration of chain-of-thought reasoning in the ModeratorLM-Think variant adds an additional layer of sophistication, allowing the model to better interpret conversational context. The construction of the RolePlayConv dataset is also a notable contribution, as it provides a tailored resource for training and evaluating role-conditioned agents in multi-party settings. However, the reliance on synthetic data may raise questions about the generalizability of the findings.
The experiments conducted demonstrate a clear improvement in turn-taking precision and recall when using the ModeratorLM models compared to non-role-conditioned baselines. The use of both real-world meeting data and the synthetic RolePlayConv dataset strengthens the evaluation. The metrics reported, including precision, recall, F1-score, and reactive miss rate, provide a comprehensive view of the model's performance. The ablation studies further validate the importance of dynamic chunking and the role of reasoning in enhancing model performance. However, the lack of extensive human evaluations beyond the small-scale study may limit the robustness of the claims regarding role fidelity.
The paper provides a detailed description of the training and evaluation setup, including the architecture of the models, the dataset construction process, and the evaluation metrics. However, there is no mention of code or data availability, which is crucial for reproducibility in machine learning research. The absence of a demo or project URL also hinders the ability for others to replicate the work.
One significant limitation is the reliance on synthetic data for training the RolePlayConv dataset, which may not fully capture the complexities of real-world multi-party conversations. Additionally, while the model shows improved performance in turn-taking, it remains conservative, missing some valid response opportunities, which could affect user experience in practical applications. The paper does not address potential biases in the dataset or the model's performance across diverse demographics.
The development of role-conditioned voice agents has the potential to significantly enhance the usability of conversational AI in various applications, such as virtual assistants, customer service, and collaborative tools. By improving turn-taking behavior, these agents can facilitate more natural and effective interactions in multi-party settings. However, ethical considerations regarding the deployment of such technology, especially in sensitive contexts, must be carefully evaluated. The main contribution of this paper is the introduction of ModeratorLM, a role-playing voice agent that enhances turn-taking in multi-party conversations through role conditioning and reasoning. This work represents a significant advancement in the field of conversational AI, addressing a critical challenge in multi-party interactions and providing a novel dataset for future research.
While low-latency interaction is critical for spoken dialogue, cascaded architectures are often bottlenecked by reactive turn-completion detection. We propose Endpoint Anticipation, shifting from reactive detection to proactive forecasting of end-of-turn signals. Our speech-based model anticipates endpoints upto 2.56 seconds in advance, enabling speculative execution of LLM and TTS pipelines on partial context. We introduce metrics to quantify the trade-off between realized latency reduction and computational redundancy. Evaluation across conversational and task-oriented datasets shows our model consistently outperforms competitive VAP-based baselines. Integration with the Unmute framework demonstrates a 505 ms average latency reduction with a 28.4% increase in speculative computation, effectively masking sequential bottlenecks to enable complex reasoning in real-time speech-to-speech interaction.
Primary: Brno University of Technology
All Institutions: Brno University of Technology, Carnegie Mellon University
The paper presents a significant advancement in low-latency spoken dialogue systems through the introduction of endpoint anticipation, which allows for proactive processing of user speech. This innovative approach, combined with a robust evaluation framework, positions the work as a valuable contribution to the field of audio and machine learning.
The paper introduces a novel approach to endpoint anticipation in spoken dialogue systems, shifting from reactive to proactive detection of end-of-turn signals. The dual-stream audio representation and the use of independent binary classification tasks for different anticipation horizons are well-structured and innovative. The proposed metrics for evaluating the trade-off between latency reduction and computational redundancy are a significant contribution to the field, allowing for a more nuanced understanding of system performance. The integration with the Unmute framework demonstrates practical applicability, although the paper could benefit from clearer explanations of the model architecture and training procedures.
The evaluation is thorough, utilizing two diverse datasets (SpokenWOZ and Switchboard) to assess the model's performance across various anticipation horizons. The results show a consistent improvement over the VAP baseline, with a notable average latency reduction of 505 ms. The introduction of specific metrics like Median Realized Anticipation and Expected Redundant Computation provides valuable insights into the model's efficiency and effectiveness. However, the paper could enhance its experimental rigor by including more comprehensive ablation studies to analyze the impact of different components of the model.
The authors mention that they will open-source their implementation, which is a positive step towards reproducibility. However, the paper lacks detailed information on hyperparameter tuning, model training specifics, and the exact configurations used in experiments, which could hinder replication efforts by other researchers.
One limitation is the reliance on specific datasets, which may not capture the full variability of real-world conversational speech. Additionally, while the model shows promise in structured dialogues, its performance in more spontaneous, open-domain conversations remains uncertain. The trade-off between latency reduction and computational redundancy, while quantified, may still lead to inefficiencies in certain scenarios, especially in longer dialogues.
The proposed framework has significant implications for real-time spoken dialogue systems, particularly in applications requiring low-latency interactions, such as virtual assistants and customer service bots. By enabling speculative execution of downstream processes, the model could enhance user experience in conversational AI, making interactions feel more natural and responsive. The open-source nature of the project may also foster further research and development in this area. The paper presents a significant advancement in low-latency spoken dialogue systems through the introduction of endpoint anticipation, which allows for proactive processing of user speech. This innovative approach, combined with a robust evaluation framework, positions the work as a valuable contribution to the field of audio and machine learning.