Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
The methodology presented in this paper is innovative, combining mobile room impulse response (RIR) capture with a visual-assisted acoustic field model. The use of commodity smartphones for constructing audio-visual digital twins is a significant advancement, as it democratizes access to advanced acoustic modeling techniques. The differentiable acoustic rendering for recovering surface material properties is a notable technical contribution, allowing for real-time modifications and updates to both audio and visual components. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their computational efficiency.
The experimental evaluation is thorough, showcasing the effectiveness of the AV-Twin system in various scenarios. The authors provide quantitative metrics for the accuracy of acoustic reconstructions and the fidelity of the visual outputs. However, the datasets used for evaluation are not extensively described, which raises questions about the generalizability of the results. More diverse environments and material types could enhance the robustness of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the authors mention the use of smartphones, they do not provide specifics on the hardware or software configurations used in their experiments. Additionally, the absence of a public code repository or demo URL limits the ability of other researchers to validate the findings independently.
One limitation of the study is the reliance on commodity smartphones, which may introduce variability in the quality of the captured data. Furthermore, the system's performance may be constrained by the physical limitations of the devices used, such as microphone sensitivity and processing power. The paper also does not address potential challenges in real-world applications, such as varying environmental conditions and user expertise.
The potential applications of AV-Twin are vast, ranging from virtual reality environments to architectural design and acoustic engineering. By enabling users to create and modify audio-visual digital twins easily, this work could significantly enhance user interaction and experience in various fields. The approach could also inspire further research into integrating acoustics with other sensory modalities in digital twin technologies. The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Primary: Peking University
All Institutions: Peking University, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
The paper introduces BRACE, a benchmark specifically designed for evaluating audio captioning metrics in a reference-free setting. It comprises two sub-benchmarks, BRACE-Main and BRACE-Hallucination, which assess fine-grained caption comparisons and hallucination detection, respectively. The methodology is robust, utilizing a combination of high-quality filtering, LLM-based corruption, and human annotation to construct datasets. The dual focus on both the quality of audio-caption alignment and the detection of hallucinations presents a comprehensive approach to addressing existing gaps in audio caption evaluation metrics. The use of diverse models and evaluation strategies enhances the credibility of the findings.
The experiments conducted on BRACE benchmark reveal significant insights into the performance of CLAP-based ACEMs and LALMs. The results indicate that even the best-performing models struggle to achieve high scores, highlighting the challenges in audio caption evaluation. The evaluation metrics are well-defined, and the performance of various models is systematically compared, providing a clear understanding of their limitations. The rigorous testing across different model architectures adds depth to the experimental evaluation.
The authors have taken steps to ensure reproducibility by providing access to the evaluation code and benchmark datasets. Detailed descriptions of the experimental configurations, including model settings and evaluation strategies, are included. However, the paper could benefit from more explicit instructions on how to replicate the experiments, particularly regarding the specific prompts and configurations used in LALM evaluations.
The paper acknowledges certain limitations, particularly regarding the performance of existing models on the benchmark. However, it could further elaborate on potential biases in the dataset construction process and the implications of using LLMs for generating and corrupting captions. Additionally, the computational constraints faced during experiments limit the ability to conduct extensive evaluations, which could affect the generalizability of the results.
The development of BRACE has significant implications for the field of audio understanding, particularly in enhancing accessibility and content indexing. By providing a reliable benchmark for evaluating audio captioning metrics, it can drive improvements in model development and evaluation practices. However, the potential for misuse of audio captioning technologies, such as generating misleading or inaccurate captions, should be considered, and appropriate safeguards should be discussed. The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
The proposed All-in-One ASR framework introduces a multi-mode joiner that effectively integrates CTC, AED, and Transducer models into a single architecture. This unification is significant as it reduces the model footprint and computational overhead while maintaining or improving recognition performance. The methodology is well-structured, leveraging joint training and decoding strategies to exploit the strengths of different ASR paradigms without the need for separate decoder branches. The use of a shared encoder and the innovative joiner mechanism are noteworthy contributions that address the challenges of model complexity and resource efficiency in ASR systems.
The experimental evaluation is robust, utilizing well-established datasets such as TED-LIUM and LibriSpeech to demonstrate the effectiveness of the All-in-One ASR framework. The results indicate that the proposed model not only matches but often surpasses the performance of individually optimized models across various ASR tasks. The paper provides detailed comparisons and ablation studies that substantiate the claims of improved performance and reduced model size, showcasing the framework's versatility in both offline and streaming modes.
While the paper outlines the architecture and training procedures in detail, it lacks specific URLs or repositories for code and datasets, which could hinder reproducibility. The absence of a public demo or project page further limits the ability of other researchers to replicate the results. However, the comprehensive description of the methodologies and experimental setups provides a solid foundation for future implementations.
One limitation is the potential complexity introduced by the multi-mode joiner, which may require careful tuning of hyperparameters to achieve optimal performance across different ASR tasks. Additionally, the paper does not address the implications of scaling this framework to more complex or diverse ASR tasks beyond those tested. The reliance on specific datasets may also limit the generalizability of the findings.
The All-in-One ASR framework has significant implications for the deployment of ASR systems in resource-constrained environments, such as mobile devices or embedded systems, where model size and computational efficiency are critical. By unifying multiple ASR paradigms, this approach could streamline the development process and reduce costs, making advanced speech recognition technology more accessible across various applications. The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).
Primary: Zhipu AI
All Institutions: Zhipu AI, Tsinghua University
GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
The methodology of GLM-TTS is well-structured, utilizing a two-stage architecture that effectively combines autoregressive and diffusion models for TTS. The introduction of a multi-reward reinforcement learning framework is particularly innovative, addressing common challenges in TTS systems such as pronunciation accuracy and emotional expressiveness. The use of a hybrid phoneme-text input scheme and optimized speech tokenizer enhances the system's controllability and adaptability, especially for languages with complex phonetic structures like Chinese. The detailed data processing pipeline and the enhancements made to the speech tokenizer demonstrate a thorough understanding of the underlying challenges in TTS.
The experiments conducted are comprehensive, comparing GLM-TTS against state-of-the-art models across various benchmarks. The results indicate that GLM-TTS achieves competitive performance with significantly less training data, which is a notable achievement. The evaluation metrics used, including CER, WER, and SIM, provide a clear picture of the system's capabilities. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper provides a link to the code repository and demo, which is a positive aspect for reproducibility. However, the details regarding the training process, hyperparameters, and specific datasets used are somewhat limited. More explicit information on the experimental setup would enhance reproducibility.
One limitation is the reliance on proprietary datasets, which may hinder the generalizability of the results. Additionally, while the system shows promise in emotional expressiveness, the paper acknowledges that the performance may vary across different emotional contexts, indicating potential areas for improvement. The complexity of the model may also pose challenges for deployment in resource-constrained environments.
The GLM-TTS system has significant implications for various applications, including virtual assistants, educational tools, and content creation. Its ability to generate high-fidelity, expressive speech with reduced training data makes it accessible for low-resource scenarios, potentially democratizing TTS technology. The focus on controllability and customization also opens avenues for personalized applications in diverse linguistic contexts. GLM-TTS presents a robust framework for efficient and high-quality text-to-speech synthesis, effectively addressing critical challenges in the field. The innovative use of reinforcement learning and hybrid input mechanisms positions it as a significant contribution to advancing TTS technology, particularly for languages with complex phonetic structures.
Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly supporting both STD and KWS. To our knowledge, this is the first comprehensive approach of its kind.
Primary: Indian Institute of Technology Hyderabad
All Institutions: Indian Institute of Technology Hyderabad
The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
The proposed joint multimodal contrastive learning framework effectively integrates audio and text modalities into a unified embedding space, addressing significant limitations of existing Acoustic Word Embedding (AWE) methods. The dual optimization of audio-text and audio-audio contrastive learning is innovative, leveraging the strengths of both modalities while enhancing intra-class compactness and inter-class separation. The methodology is well-structured, with clear explanations of the loss functions and training regime, although further details on hyperparameter tuning could enhance clarity.
The experiments are robust, utilizing the LibriSpeech corpus to evaluate the proposed model against multiple baselines. The performance metrics, including Average Precision (AP) and Equal Error Rates (EER), provide a comprehensive view of the model's capabilities in both Spoken Term Detection (STD) and Keyword Spotting (KWS). The results demonstrate consistent improvements over existing methods, particularly in challenging conditions, which underscores the effectiveness of the proposed approach.
The authors emphasize reproducibility by releasing a standardized evaluation framework and the trial generation recipe alongside their codebase. This commitment to transparency is commendable and facilitates further research in the field. However, more detailed documentation on the training process and hyperparameter settings would be beneficial for full reproducibility.
While the paper presents a significant advancement, it does not extensively discuss the potential computational costs associated with the proposed model, particularly in real-time applications. Additionally, the reliance on the LibriSpeech dataset may limit the generalizability of the findings to other languages or dialects.
The proposed framework has the potential to significantly improve spoken content retrieval systems, making them more robust to variations in speaker and background noise. This advancement could enhance accessibility in various applications, such as voice-activated systems and automated transcription services, thereby contributing to the broader adoption of speech technologies. The paper introduces a novel joint multimodal contrastive learning framework for robust spoken term detection and keyword spotting, demonstrating significant improvements over existing methods. The comprehensive methodology and rigorous experimental evaluation highlight its potential impact on the field of audio processing and machine learning.
Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
Primary: South China University of Technology
All Institutions: South China University of Technology, Guangdong Provincial Key Laboratory of AI Large Model and Intelligent Cognition, Engineering Research Centre of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human
The paper presents a novel framework for music emotion recognition that combines a large-scale expert-annotated dataset with an innovative dual-view adaptive learning method. This work significantly contributes to addressing the challenges of data scarcity and feature drift in the field, showcasing the potential for improved emotion recognition in music.
The proposed methodology introduces a comprehensive framework for music emotion recognition, addressing critical challenges such as data scarcity and feature drift. The use of a large-scale, expert-annotated dataset (Memo2496) is a significant advancement, ensuring high-quality annotations through rigorous protocols. The Dual-View Adaptive Music Emotion Recogniser (DAMER) employs innovative modules like Dual Stream Attention Fusion (DSAF) for effective feature interaction, Progressive Confidence Labelling (PCL) for reliable pseudo-label generation, and Style Anchored Memory Learning (SAML) to mitigate cross-track feature drift. This multi-faceted approach demonstrates a thoughtful integration of various techniques, enhancing the robustness of the model.
The experiments conducted on multiple datasets (Memo2496, 1000songs, and PMEmo) validate the effectiveness of the DAMER framework. The reported improvements in arousal dimension accuracy across different datasets underscore the model's generalizability and robustness. The ablation studies provide valuable insights into the contributions of each module, reinforcing the significance of the proposed methods. However, the paper could benefit from more extensive comparisons with a wider range of contemporary methods to fully contextualize its contributions.
The paper provides a clear description of the dataset, methodology, and experimental setup, which facilitates reproducibility. The availability of the dataset and source code on Figshare is a positive aspect, promoting transparency and enabling other researchers to replicate the findings. However, the paper lacks detailed hyperparameter settings and training configurations, which could further enhance reproducibility.
While the paper addresses several key challenges in music emotion recognition, it does not thoroughly discuss potential limitations of the proposed methods. For instance, the reliance on expert annotators, while beneficial for quality, may introduce biases that could affect the generalizability of the dataset. Additionally, the performance improvements, although significant, may not be sufficient for real-world applications where diverse and complex emotional expressions in music are encountered.
The advancements in music emotion recognition have potential applications in various fields, including personalized music recommendation systems, mental health interventions, and immersive entertainment experiences. The introduction of a high-quality dataset and a robust recognition framework can significantly enhance the accuracy and reliability of emotion-based applications in music, contributing to the broader field of affective computing. The paper presents a novel framework for music emotion recognition that combines a large-scale expert-annotated dataset with an innovative dual-view adaptive learning method. This work significantly contributes to addressing the challenges of data scarcity and feature drift in the field, showcasing the potential for improved emotion recognition in music.
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Primary: Tsinghua University
All Institutions: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua University, Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University
The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
The methodology for constructing the HQ-MPSD dataset is robust and innovative. It employs a three-stage process for generating partial deepfake speech that emphasizes linguistic coherence and acoustic fidelity. The use of fine-grained forced alignment for splice points and the normalization of loudness and spectral characteristics are noteworthy techniques that enhance the quality of the dataset. Additionally, the incorporation of background effects to simulate real-world conditions is a significant improvement over existing datasets. The careful design choices made to minimize artifacts and ensure natural transitions contribute to the dataset's overall quality and applicability for training detection models.
The experiments conducted using HQ-MPSD are comprehensive and well-structured. The cross-language and cross-dataset evaluations provide valuable insights into the generalization capabilities of state-of-the-art detection models. The performance drop observed in existing models when tested on HQ-MPSD highlights the dataset's effectiveness in revealing the limitations of current methodologies. The use of metrics such as Equal Error Rate (EER) and Area Under the Curve (AUC) for evaluation is appropriate and provides a clear understanding of model performance.
The paper provides sufficient detail regarding the dataset generation process and experimental setup, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The dataset itself is accessible, which is a positive aspect for researchers looking to build upon this work.
While the dataset is a significant advancement, it may still have limitations regarding the diversity of accents and dialects within the eight languages represented. Additionally, the reliance on forced alignment may introduce its own biases, particularly if the alignment tools are not perfectly accurate. The paper does not address potential ethical concerns related to the misuse of deepfake technology, which is an important consideration in this field.
The development of HQ-MPSD has the potential to significantly advance the field of deepfake detection by providing a high-quality, multilingual benchmark that can improve the robustness of detection models. The dataset's design encourages the exploration of genuine manipulation cues rather than superficial artifacts, which can lead to more effective solutions in real-world applications. This work is particularly relevant in the context of misinformation and security, where the ability to detect partial deepfake speech can have substantial societal implications. The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
The proposed methodology of DisCo-Speech is innovative, focusing on disentangling speech attributes into content, prosody, and timbre through a two-stage training paradigm. The tri-factor disentanglement approach is a significant advancement over existing methods, allowing for independent control over speech generation. The use of hybrid losses and parallel encoders is well-justified, addressing the disentanglement-reconstruction trade-off effectively. The integration of a standard LM for prosodic continuation and a specialized decoder for waveform synthesis is a thoughtful design choice that enhances the flexibility of the system.
The experimental evaluation is thorough, utilizing a diverse dataset and comparing DisCo-Speech against state-of-the-art models. The results demonstrate competitive performance in voice cloning and prosody control, with clear metrics provided for reconstruction quality and controllability. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings. However, more extensive comparisons with a broader range of existing methods could provide deeper insights into its relative performance.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. The authors also mention plans to release code and weights, which is essential for enabling other researchers to validate the findings and build upon the work. However, the absence of specific details about the training data and preprocessing steps could hinder full reproducibility.
The paper acknowledges limitations, including lower speaker similarity compared to multi-stage systems and potential instability in generating exaggerated prosody. The delicate balance between disentanglement and reconstruction fidelity is also highlighted as an ongoing challenge. These limitations suggest areas for future improvement, particularly in enhancing the expressive range and fidelity of the generated speech.
The advancements presented in DisCo-Speech have significant implications for applications in human-computer interaction, entertainment, and accessibility technologies. The ability to generate speech with controlled prosody and timbre could enhance user experience in virtual assistants, audiobooks, and language learning tools. Furthermore, the framework's potential for zero-shot learning could democratize access to high-quality speech synthesis across diverse languages and dialects. The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
The proposed All-in-One ASR framework introduces a multi-mode joiner that effectively integrates CTC, AED, and Transducer models into a single architecture. This unification is significant as it reduces the model footprint and computational overhead while maintaining or improving recognition performance. The methodology is well-structured, leveraging joint training and decoding strategies to exploit the strengths of different ASR paradigms without the need for separate decoder branches. The use of a shared encoder and the innovative joiner mechanism are noteworthy contributions that address the challenges of model complexity and resource efficiency in ASR systems.
The experimental evaluation is robust, utilizing well-established datasets such as TED-LIUM and LibriSpeech to demonstrate the effectiveness of the All-in-One ASR framework. The results indicate that the proposed model not only matches but often surpasses the performance of individually optimized models across various ASR tasks. The paper provides detailed comparisons and ablation studies that substantiate the claims of improved performance and reduced model size, showcasing the framework's versatility in both offline and streaming modes.
While the paper outlines the architecture and training procedures in detail, it lacks specific URLs or repositories for code and datasets, which could hinder reproducibility. The absence of a public demo or project page further limits the ability of other researchers to replicate the results. However, the comprehensive description of the methodologies and experimental setups provides a solid foundation for future implementations.
One limitation is the potential complexity introduced by the multi-mode joiner, which may require careful tuning of hyperparameters to achieve optimal performance across different ASR tasks. Additionally, the paper does not address the implications of scaling this framework to more complex or diverse ASR tasks beyond those tested. The reliance on specific datasets may also limit the generalizability of the findings.
The All-in-One ASR framework has significant implications for the deployment of ASR systems in resource-constrained environments, such as mobile devices or embedded systems, where model size and computational efficiency are critical. By unifying multiple ASR paradigms, this approach could streamline the development process and reduce costs, making advanced speech recognition technology more accessible across various applications. The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses variable-length polyphonic note sequences into compact 64-dimensional phrase-level representations with high reconstruction fidelity, allowing efficient training and a well-structured latent space. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes in 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
The paper introduces PhraseVAE and PhraseLDM, which leverage latent diffusion for symbolic music generation. The methodology is innovative as it compresses polyphonic note sequences into a structured latent space, allowing for efficient training and generation. The use of phrase-level representations instead of note-attribute tokens is a significant shift that addresses limitations in existing models. However, the details on the training process and the specific architecture of the latent diffusion model could be elaborated further to enhance understanding.
The experiments demonstrate the framework's ability to generate full songs with coherent structure and idiomatic instrument patterns. The evaluation metrics used to assess musical quality and generation diversity are not explicitly detailed, which could limit the assessment of the model's performance. The ability to generate 128 bars of music in a single pass is a notable achievement, indicating a strong technical contribution.
The paper does not provide sufficient details on the implementation or datasets used for training and evaluation, which raises concerns about reproducibility. Including a code repository or supplementary materials would greatly enhance the reproducibility of the results.
One limitation is the lack of detailed evaluation metrics and comparisons with existing state-of-the-art models. Additionally, while the model can generate music quickly, the paper does not discuss potential challenges in ensuring the musicality and creativity of the generated pieces over longer sequences.
The proposed framework has the potential to significantly advance the field of symbolic music generation, encouraging researchers to explore phrase-level modeling. This could lead to more sophisticated music generation systems that better capture the nuances of musical composition. The approach may also inspire applications in interactive music systems and automated composition tools. The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses an arbitrary variable-length polyphonic note sequence into a single compact 64-dimensional phrase-level latent representation with high reconstruction fidelity, allowing a well-structured latent space and efficient generative modeling. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes at 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PhraseVAE and PhraseLDM, which provide a novel latent diffusion approach for full-song multitrack symbolic music generation. This work represents a significant advancement in addressing the limitations of existing models, particularly in handling long sequences and maintaining musical coherence, thereby paving the way for future research in the domain.
The methodology introduced in this paper is innovative, leveraging a latent diffusion framework specifically tailored for symbolic music generation. The PhraseVAE component effectively compresses polyphonic note sequences into a compact latent representation, addressing the challenges of long sequences and limited context. The PhraseLDM builds on this representation to generate full songs in a single pass, which is a significant departure from traditional autoregressive models. The approach is well-structured, and the authors provide a clear rationale for their design choices, although further details on the training process and hyperparameter tuning would enhance the understanding of the methodology.
The experimental section demonstrates the capabilities of the proposed models through various metrics, including musical quality and generation diversity. The authors report that their framework can generate complete songs quickly while maintaining coherence and structure, which is a notable achievement. However, the paper would benefit from a more extensive comparison with existing state-of-the-art models to substantiate the claims of superiority in musical quality and efficiency.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. Key aspects such as the dataset used, specific training procedures, and evaluation metrics are not thoroughly detailed. Providing a code repository or supplementary materials would significantly improve the reproducibility of the results.
One limitation of the study is the lack of a comprehensive evaluation against a broader range of existing symbolic music generation models. Additionally, the reliance on a single latent representation may not capture all the nuances of music composition, potentially limiting the expressiveness of the generated pieces. The authors also do not address how the model performs with different genres or styles of music.
The proposed framework has the potential to significantly influence the field of symbolic music generation, encouraging researchers to explore phrase-level modeling over traditional note-attribute approaches. This could lead to advancements in music composition tools, AI-assisted music creation, and educational applications in music theory. The implications of generating high-quality music efficiently could also extend to various entertainment and media industries. The main contribution of this paper is the introduction of PhraseVAE and PhraseLDM, which provide a novel latent diffusion approach for full-song multitrack symbolic music generation. This work represents a significant advancement in addressing the limitations of existing models, particularly in handling long sequences and maintaining musical coherence, thereby paving the way for future research in the domain.
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Primary: Peking University
All Institutions: Peking University, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
The paper introduces BRACE, a benchmark specifically designed for evaluating audio captioning metrics in a reference-free setting. It comprises two sub-benchmarks, BRACE-Main and BRACE-Hallucination, which assess fine-grained caption comparisons and hallucination detection, respectively. The methodology is robust, utilizing a combination of high-quality filtering, LLM-based corruption, and human annotation to construct datasets. The dual focus on both the quality of audio-caption alignment and the detection of hallucinations presents a comprehensive approach to addressing existing gaps in audio caption evaluation metrics. The use of diverse models and evaluation strategies enhances the credibility of the findings.
The experiments conducted on BRACE benchmark reveal significant insights into the performance of CLAP-based ACEMs and LALMs. The results indicate that even the best-performing models struggle to achieve high scores, highlighting the challenges in audio caption evaluation. The evaluation metrics are well-defined, and the performance of various models is systematically compared, providing a clear understanding of their limitations. The rigorous testing across different model architectures adds depth to the experimental evaluation.
The authors have taken steps to ensure reproducibility by providing access to the evaluation code and benchmark datasets. Detailed descriptions of the experimental configurations, including model settings and evaluation strategies, are included. However, the paper could benefit from more explicit instructions on how to replicate the experiments, particularly regarding the specific prompts and configurations used in LALM evaluations.
The paper acknowledges certain limitations, particularly regarding the performance of existing models on the benchmark. However, it could further elaborate on potential biases in the dataset construction process and the implications of using LLMs for generating and corrupting captions. Additionally, the computational constraints faced during experiments limit the ability to conduct extensive evaluations, which could affect the generalizability of the results.
The development of BRACE has significant implications for the field of audio understanding, particularly in enhancing accessibility and content indexing. By providing a reliable benchmark for evaluating audio captioning metrics, it can drive improvements in model development and evaluation practices. However, the potential for misuse of audio captioning technologies, such as generating misleading or inaccurate captions, should be considered, and appropriate safeguards should be discussed. The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
The methodology presented in this paper is innovative, combining mobile room impulse response (RIR) capture with a visual-assisted acoustic field model. The use of commodity smartphones for constructing audio-visual digital twins is a significant advancement, as it democratizes access to advanced acoustic modeling techniques. The differentiable acoustic rendering for recovering surface material properties is a notable technical contribution, allowing for real-time modifications and updates to both audio and visual components. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their computational efficiency.
The experimental evaluation is thorough, showcasing the effectiveness of the AV-Twin system in various scenarios. The authors provide quantitative metrics for the accuracy of acoustic reconstructions and the fidelity of the visual outputs. However, the datasets used for evaluation are not extensively described, which raises questions about the generalizability of the results. More diverse environments and material types could enhance the robustness of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the authors mention the use of smartphones, they do not provide specifics on the hardware or software configurations used in their experiments. Additionally, the absence of a public code repository or demo URL limits the ability of other researchers to validate the findings independently.
One limitation of the study is the reliance on commodity smartphones, which may introduce variability in the quality of the captured data. Furthermore, the system's performance may be constrained by the physical limitations of the devices used, such as microphone sensitivity and processing power. The paper also does not address potential challenges in real-world applications, such as varying environmental conditions and user expertise.
The potential applications of AV-Twin are vast, ranging from virtual reality environments to architectural design and acoustic engineering. By enabling users to create and modify audio-visual digital twins easily, this work could significantly enhance user interaction and experience in various fields. The approach could also inspire further research into integrating acoustics with other sensory modalities in digital twin technologies. The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo; Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
The methodology presented in MR-FlowDPO is innovative, leveraging Direct Preference Optimization (DPO) to align music generation with human preferences. The approach of using multiple musical rewards to evaluate text alignment, audio production quality, and semantic consistency is well-structured. The integration of scalable off-the-shelf models for reward prediction is a practical choice that enhances the model's applicability. However, the paper could benefit from a more detailed explanation of the scoring mechanism and how it specifically improves rhythmic stability.
The experiments conducted are extensive, utilizing both objective metrics and human evaluations to assess the effectiveness of the proposed model. The results indicate a significant improvement over competitive baselines, which strengthens the claims made in the paper. However, the paper lacks a detailed description of the datasets used, which is crucial for understanding the generalizability of the findings.
The authors provide links to their code and demo page, which is a positive aspect for reproducibility. However, the paper does not sufficiently detail the experimental setup, including hyperparameters and training procedures, which may hinder full reproducibility by other researchers.
One limitation is the potential subjectivity in human evaluations, which can vary widely among individuals. Additionally, the reliance on off-the-shelf models for reward prediction may introduce biases based on the limitations of those models. The paper could also explore the scalability of the approach in real-world applications beyond the experimental settings.
The implications of this research are significant for the field of music generation, as it addresses the subjective nature of music evaluation and aims to create models that better align with human preferences. This could lead to more personalized music generation applications, enhancing user experience in various domains such as entertainment and therapy. The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics. Our results establish that semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics.
Primary: Northeastern University
All Institutions: Northeastern University
The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
The paper introduces a novel framework for confidence calibration in automated audio captioning that integrates a learned confidence prediction head with a Whisper-based model. This approach is innovative as it shifts the focus from traditional n-gram overlap metrics to semantic similarity for evaluating correctness, which is a significant advancement in the field. The architecture is well-defined, with clear descriptions of the confidence prediction head, temperature scaling, and confidence-guided beam search. The methodology is robust and addresses existing limitations in audio captioning systems effectively.
The experiments conducted on the Clotho v2 dataset are comprehensive, demonstrating substantial improvements in both calibration and caption quality metrics. The results are compelling, with a dramatic reduction in Expected Calibration Error (ECE) from 0.488 to 0.071, showcasing the effectiveness of the proposed method. Additionally, the paper provides quantitative results across multiple evaluation metrics (BLEU, CIDEr, CLAP similarity), which strengthens the validity of the findings.
The implementation details are adequately described, including the model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits reproducibility. Future work should consider making the code accessible to facilitate validation of results by the research community.
The paper acknowledges several limitations, including the somewhat arbitrary threshold for semantic correctness and the evaluation being limited to the Clotho dataset. The authors also note that the confidence head may not capture all sources of uncertainty, suggesting areas for future exploration. These limitations are important to consider for the generalization of the findings.
The proposed framework has significant implications for real-world applications of automated audio captioning, particularly in accessibility technologies and content indexing. By improving the reliability of predictions, this work could enhance user trust in automated systems, leading to broader adoption in various domains. The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.