Turn-taking modeling is fundamental to spoken dialogue systems, yet its evaluation remains fragmented and often limited to binary boundary detection under narrow interaction settings. Such protocols hinder systematic comparison and obscure model weaknesses across conversational conditions. We present CoDeTT, a context-aware decision benchmark for turn-taking evaluation. CoDeTT formulates turn-taking as a structured decision problem and constructs a multi-scenario dataset with fine-grained decision categories and controlled context variations. Under a unified evaluation protocol, we assess representative existing models and observe substantial performance disparities across decision types and interaction scenarios. CoDeTT provides a standardized benchmark for systematic and context-aware evaluation of turn-taking systems. The benchmark dataset and evaluation toolkit are available at https://github.com/YingaoWang-casia/CoDeTT.github.io.
Primary: BRVoice Team
All Institutions: BRVoice Team
The main contribution of this paper is the introduction of CoDeTT, a context-aware decision benchmark for turn-taking evaluation that systematically addresses the limitations of existing evaluation protocols. This work is significant as it enhances the understanding of model performance in conversational systems and provides a foundation for future research in turn-taking dynamics.
The proposed methodology introduces CoDeTT, a structured decision benchmark for turn-taking evaluation that effectively captures the complexities of conversational interactions. By formulating turn-taking as a structured decision problem and creating a multi-scenario dataset with fine-grained decision categories, the authors provide a robust framework for evaluating existing models. The use of a hierarchical taxonomy and a Two-Stage Funnel Evaluation Protocol enhances the depth of analysis, allowing for both coarse and fine-grained assessments of model performance.
The experiments are comprehensive, utilizing a large bilingual dataset of over 300 hours of dialogue, which is well-annotated and balanced across various decision scenarios. The evaluation of existing models under the CoDeTT benchmark reveals significant performance disparities, highlighting the utility of the benchmark in exposing model weaknesses. The introduction of the Semantic Misalignment Rate (SMR) as a diagnostic metric is particularly noteworthy, as it provides insights into the underlying reasoning of models.
The paper provides a clear description of the dataset construction process and the evaluation protocol, which enhances reproducibility. The availability of the dataset and evaluation toolkit on GitHub further supports this aspect. However, the paper could benefit from more detailed implementation guidelines for replicating the experiments.
One limitation is the reliance on synthetic data generation for part of the dataset, which may introduce biases or artifacts that could affect model performance. Additionally, while the benchmark exposes decision-specific performance variations, it may not fully account for all contextual nuances present in real-world conversations.
The CoDeTT benchmark has the potential to significantly influence the development of more sophisticated and context-aware conversational agents, improving user experience in spoken dialogue systems. By providing a standardized evaluation framework, it encourages further research into turn-taking modeling and may lead to advancements in human-computer interaction. The main contribution of this paper is the introduction of CoDeTT, a context-aware decision benchmark for turn-taking evaluation that systematically addresses the limitations of existing evaluation protocols. This work is significant as it enhances the understanding of model performance in conversational systems and provides a foundation for future research in turn-taking dynamics.
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.
Primary: BRVoice Team
All Institutions: BRVoice Team
The paper presents CLAR, a novel dual-encoder retrieval system that enhances contextual ASR by effectively localizing hotwords without timestamp supervision. The innovative methodology and significant experimental results position CLAR as a valuable contribution to the field of speech recognition and retrieval-augmented systems.
The paper introduces CLAR, a dual-encoder speech-text retriever that utilizes a Continuous Integrate-and-Fire (CIF) mechanism for monotonic token-level alignment without timestamps. The methodology is innovative, particularly in its approach to localized matching and the multi-granularity training objective that combines various contrastive losses. The use of CIF for hotword retrieval under weak supervision is a significant advancement in addressing the challenges of named entity recognition in ASR systems.
The experiments are well-structured, utilizing relevant datasets (AISHELL-1 and AISHELL-2) and metrics (CER, B-WER, recall, F1 score) to evaluate the performance of CLAR against strong baselines. The results demonstrate significant improvements in hotword retrieval and ASR accuracy, validating the effectiveness of the proposed method. However, the paper could benefit from additional comparative analyses with more recent state-of-the-art methods.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which would allow for reproducibility. However, the lack of a publicly available code repository or demo limits accessibility for further validation by the research community.
The paper does not address potential limitations in terms of scalability to larger datasets or multilingual settings, which could affect the generalizability of the findings. Additionally, the reliance on weak supervision may introduce challenges in alignment accuracy, particularly in noisy environments.
The advancements presented in this paper have significant implications for improving ASR systems in real-world applications, particularly in domains requiring accurate recognition of low-frequency words and named entities. The modular nature of CLAR allows for integration with various Speech LLMs, potentially enhancing user interactions in conversational AI systems. The paper presents CLAR, a novel dual-encoder retrieval system that enhances contextual ASR by effectively localizing hotwords without timestamp supervision. The innovative methodology and significant experimental results position CLAR as a valuable contribution to the field of speech recognition and retrieval-augmented systems.
Speech LLM-based ASR often struggles with named entities and long-tail words due to strong internal language-model priors. Retrieval-augmented biasing can help, but its effectiveness depends on accurate hotword localization in full-utterance speech under weak supervision. We propose CLAR, a dual-encoder speech-text retriever that uses Continuous Integrate-and-Fire (CIF) to learn monotonic token-level alignments without timestamps. With length-aware localized matching, CLAR anchors short-entity acoustic cues and reduces representation dilution and attention drift. The retriever is trained with a multi-granularity objective combining global and local segment-level contrastive losses and a CIF quantity constraint. At inference, top-ranked hotwords are injected as contextual prompts for the Speech LLM, improving recognition without shallow fusion. Experiments show that CLAR significantly improves hotword retrieval and reduces both CER and B-WER against strong contextual ASR baselines.
Primary: BRVoice Team
All Institutions: BRVoice Team
The paper presents CLAR, a novel dual-encoder retrieval system that enhances contextual ASR by effectively localizing hotwords without timestamp supervision. The innovative methodology and significant experimental results position CLAR as a valuable contribution to the field of speech recognition and retrieval-augmented systems.
The paper introduces CLAR, a dual-encoder speech-text retriever that utilizes a Continuous Integrate-and-Fire (CIF) mechanism for monotonic token-level alignment without timestamps. The methodology is innovative, particularly in its approach to localized matching and the multi-granularity training objective that combines various contrastive losses. The use of CIF for hotword retrieval under weak supervision is a significant advancement in addressing the challenges of named entity recognition in ASR systems.
The experiments are well-structured, utilizing relevant datasets (AISHELL-1 and AISHELL-2) and metrics (CER, B-WER, recall, F1 score) to evaluate the performance of CLAR against strong baselines. The results demonstrate significant improvements in hotword retrieval and ASR accuracy, validating the effectiveness of the proposed method. However, the paper could benefit from additional comparative analyses with more recent state-of-the-art methods.
The paper provides sufficient details regarding the architecture, training procedures, and evaluation metrics, which would allow for reproducibility. However, the lack of a publicly available code repository or demo limits accessibility for further validation by the research community.
The paper does not address potential limitations in terms of scalability to larger datasets or multilingual settings, which could affect the generalizability of the findings. Additionally, the reliance on weak supervision may introduce challenges in alignment accuracy, particularly in noisy environments.
The advancements presented in this paper have significant implications for improving ASR systems in real-world applications, particularly in domains requiring accurate recognition of low-frequency words and named entities. The modular nature of CLAR allows for integration with various Speech LLMs, potentially enhancing user interactions in conversational AI systems. The paper presents CLAR, a novel dual-encoder retrieval system that enhances contextual ASR by effectively localizing hotwords without timestamp supervision. The innovative methodology and significant experimental results position CLAR as a valuable contribution to the field of speech recognition and retrieval-augmented systems.
Turn-taking modeling is fundamental to spoken dialogue systems, yet its evaluation remains fragmented and often limited to binary boundary detection under narrow interaction settings. Such protocols hinder systematic comparison and obscure model weaknesses across conversational conditions. We present CoDeTT, a context-aware decision benchmark for turn-taking evaluation. CoDeTT formulates turn-taking as a structured decision problem and constructs a multi-scenario dataset with fine-grained decision categories and controlled context variations. Under a unified evaluation protocol, we assess representative existing models and observe substantial performance disparities across decision types and interaction scenarios. CoDeTT provides a standardized benchmark for systematic and context-aware evaluation of turn-taking systems. The benchmark dataset and evaluation toolkit are available at https://github.com/YingaoWang-casia/CoDeTT.github.io.
Primary: BRVoice Team
All Institutions: BRVoice Team
The main contribution of this paper is the introduction of CoDeTT, a context-aware decision benchmark for turn-taking evaluation that systematically addresses the limitations of existing evaluation protocols. This work is significant as it enhances the understanding of model performance in conversational systems and provides a foundation for future research in turn-taking dynamics.
The proposed methodology introduces CoDeTT, a structured decision benchmark for turn-taking evaluation that effectively captures the complexities of conversational interactions. By formulating turn-taking as a structured decision problem and creating a multi-scenario dataset with fine-grained decision categories, the authors provide a robust framework for evaluating existing models. The use of a hierarchical taxonomy and a Two-Stage Funnel Evaluation Protocol enhances the depth of analysis, allowing for both coarse and fine-grained assessments of model performance.
The experiments are comprehensive, utilizing a large bilingual dataset of over 300 hours of dialogue, which is well-annotated and balanced across various decision scenarios. The evaluation of existing models under the CoDeTT benchmark reveals significant performance disparities, highlighting the utility of the benchmark in exposing model weaknesses. The introduction of the Semantic Misalignment Rate (SMR) as a diagnostic metric is particularly noteworthy, as it provides insights into the underlying reasoning of models.
The paper provides a clear description of the dataset construction process and the evaluation protocol, which enhances reproducibility. The availability of the dataset and evaluation toolkit on GitHub further supports this aspect. However, the paper could benefit from more detailed implementation guidelines for replicating the experiments.
One limitation is the reliance on synthetic data generation for part of the dataset, which may introduce biases or artifacts that could affect model performance. Additionally, while the benchmark exposes decision-specific performance variations, it may not fully account for all contextual nuances present in real-world conversations.
The CoDeTT benchmark has the potential to significantly influence the development of more sophisticated and context-aware conversational agents, improving user experience in spoken dialogue systems. By providing a standardized evaluation framework, it encourages further research into turn-taking modeling and may lead to advancements in human-computer interaction. The main contribution of this paper is the introduction of CoDeTT, a context-aware decision benchmark for turn-taking evaluation that systematically addresses the limitations of existing evaluation protocols. This work is significant as it enhances the understanding of model performance in conversational systems and provides a foundation for future research in turn-taking dynamics.