Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
Primary: Florida State University
All Institutions: Florida State University, University of Tennessee at Chattanooga
The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
The paper presents an innovative unsupervised approach to eavesdropping on in-person conversations using mmWave sensing technology. The methodology is well-structured, addressing significant challenges such as low resolution of vibration signals and interference from static objects. The authors propose a multi-module system that includes a speech-aware calibration scheme, a noise-robust signal processing pipeline, and a deep learning framework for signal enhancement. The unsupervised clustering method for speaker distinction is particularly noteworthy as it allows for effective speaker attribution without prior knowledge of the number of speakers or their identities.
The experimental validation is extensive, demonstrating the effectiveness of the proposed attack in various real-world scenarios, including different object materials and speaker arrangements. The reported success rates for speaker distinction (up to 0.99) and consistent signal enhancement across setups provide strong evidence of the method's robustness. However, the reliance on synthetic datasets for training may raise questions about the generalizability of the results.
While the paper provides detailed descriptions of the experimental setup and methodology, there is a lack of publicly available code or datasets, which could hinder reproducibility. The authors mention using a specific radar system and a synthetic dataset, but without access to these resources, independent verification of results may be challenging.
The study does not address potential ethical concerns associated with the proposed eavesdropping technique. Additionally, while the method shows promise in controlled environments, its effectiveness in more complex, real-world scenarios with varying noise levels and object types remains uncertain. The performance may degrade significantly in less ideal conditions, which is not thoroughly explored in the paper.
The implications of this research are significant, as it highlights potential privacy risks associated with passive sensing technologies in shared environments. The ability to eavesdrop on sensitive conversations raises ethical concerns regarding surveillance and data privacy. This work could inform future regulations and security measures to protect against such vulnerabilities in various settings, including corporate and healthcare environments. The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\%, 4.60\%, and 4.80\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\%, 3.52\%, and 4.38\% across the same partitions.
Primary: Brno University of Technology
All Institutions: The Hong Kong Polytechnic University, Brno University of Technology, Johns Hopkins University
This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
The paper presents a robust ensemble framework for Environmental Sound Deepfake Detection, leveraging diverse Self-Supervised Learning (SSL) models and a Multi-Head Factorized Attention (MHFA) mechanism. The integration of feature domain augmentation based on distribution uncertainty modeling is particularly innovative, enhancing the model's robustness against unseen spectral distortions. The systematic comparison between general audio and speech-specific SSL models is well-structured, providing valuable insights into their respective performances in the context of deepfake detection.
The experimental setup is rigorous, utilizing the official EnvSDD dataset exclusively, which ensures that the results are directly relevant to the challenge. The reported Equal Error Rates (EER) demonstrate significant improvements over baseline models, particularly with the ensemble system achieving an EER of 3.52%. The results are well-documented, and the analysis of the impact of different SSL models and augmentation strategies is thorough.
The paper provides sufficient implementation details, including the training configuration, optimizer settings, and model architecture. The use of an open-source framework (WeDefense) for implementation enhances the reproducibility of the results. However, the lack of access to the dataset used for training may limit full reproducibility for external researchers.
While the approach shows promising results, the reliance on a single dataset may limit the generalizability of the findings. Additionally, the performance metrics primarily focus on EER, which may not capture all aspects of model performance, such as precision and recall in real-world applications.
The proposed methods have significant implications for the field of audio deepfake detection, particularly in applications related to security and content verification. By addressing the challenges of generalization to unseen generators, this work contributes to the ongoing efforts to develop robust anti-spoofing technologies in audio processing. This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.
Primary: UNC Chapel Hill
All Institutions: UNC Chapel Hill
The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
The methodology presented in the paper is innovative as it combines audio encoding with a language model to create a multimodal QA system for speech quality assessment. The use of template-based question-answer pairs to train the model on the NISQA corpus is a thoughtful approach that allows for flexibility in querying and understanding audio quality metrics. The model's ability to generate textual answers rather than fixed scores enhances interpretability and user interaction, which is a significant advancement over traditional methods.
The experiments conducted on held-out NISQA clips demonstrate solid performance, with a mean absolute error of 0.41 and a Pearson correlation of 0.86, indicating a strong correlation with human ratings. The competitive performance on dimension-wise tasks further validates the effectiveness of the proposed system. However, the paper could benefit from a more extensive evaluation across different datasets and real-world scenarios to fully establish its robustness.
The authors provide a GitHub repository with code, model weights, and experimental results, which is a positive aspect for reproducibility. However, the paper could enhance reproducibility by including more detailed descriptions of hyperparameters, training procedures, and evaluation metrics used in the experiments.
One limitation is the reliance on the NISQA corpus, which may not encompass all possible audio quality scenarios encountered in real-world applications. Additionally, while the model offers flexibility in querying, it may still struggle with edge cases or highly subjective audio quality assessments that require nuanced human judgment.
The potential applications of SpeechQualityLLM are significant, particularly in telephony, VoIP, and streaming services, where audio quality is paramount. By reducing reliance on costly human evaluations and enabling interactive queries, this system could streamline quality assessment processes and improve user experiences across various platforms. The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
The proposed All-in-One ASR framework introduces a multi-mode joiner that effectively integrates CTC, AED, and Transducer models into a single architecture. This unification is significant as it reduces the model footprint and computational overhead while maintaining or improving recognition performance. The methodology is well-structured, leveraging joint training and decoding strategies to exploit the strengths of different ASR paradigms without the need for separate decoder branches. The use of a shared encoder and the innovative joiner mechanism are noteworthy contributions that address the challenges of model complexity and resource efficiency in ASR systems.
The experimental evaluation is robust, utilizing well-established datasets such as TED-LIUM and LibriSpeech to demonstrate the effectiveness of the All-in-One ASR framework. The results indicate that the proposed model not only matches but often surpasses the performance of individually optimized models across various ASR tasks. The paper provides detailed comparisons and ablation studies that substantiate the claims of improved performance and reduced model size, showcasing the framework's versatility in both offline and streaming modes.
While the paper outlines the architecture and training procedures in detail, it lacks specific URLs or repositories for code and datasets, which could hinder reproducibility. The absence of a public demo or project page further limits the ability of other researchers to replicate the results. However, the comprehensive description of the methodologies and experimental setups provides a solid foundation for future implementations.
One limitation is the potential complexity introduced by the multi-mode joiner, which may require careful tuning of hyperparameters to achieve optimal performance across different ASR tasks. Additionally, the paper does not address the implications of scaling this framework to more complex or diverse ASR tasks beyond those tested. The reliance on specific datasets may also limit the generalizability of the findings.
The All-in-One ASR framework has significant implications for the deployment of ASR systems in resource-constrained environments, such as mobile devices or embedded systems, where model size and computational efficiency are critical. By unifying multiple ASR paradigms, this approach could streamline the development process and reduce costs, making advanced speech recognition technology more accessible across various applications. The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses variable-length polyphonic note sequences into compact 64-dimensional phrase-level representations with high reconstruction fidelity, allowing efficient training and a well-structured latent space. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes in 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
The paper introduces PhraseVAE and PhraseLDM, which leverage latent diffusion for symbolic music generation. The methodology is innovative as it compresses polyphonic note sequences into a structured latent space, allowing for efficient training and generation. The use of phrase-level representations instead of note-attribute tokens is a significant shift that addresses limitations in existing models. However, the details on the training process and the specific architecture of the latent diffusion model could be elaborated further to enhance understanding.
The experiments demonstrate the framework's ability to generate full songs with coherent structure and idiomatic instrument patterns. The evaluation metrics used to assess musical quality and generation diversity are not explicitly detailed, which could limit the assessment of the model's performance. The ability to generate 128 bars of music in a single pass is a notable achievement, indicating a strong technical contribution.
The paper does not provide sufficient details on the implementation or datasets used for training and evaluation, which raises concerns about reproducibility. Including a code repository or supplementary materials would greatly enhance the reproducibility of the results.
One limitation is the lack of detailed evaluation metrics and comparisons with existing state-of-the-art models. Additionally, while the model can generate music quickly, the paper does not discuss potential challenges in ensuring the musicality and creativity of the generated pieces over longer sequences.
The proposed framework has the potential to significantly advance the field of symbolic music generation, encouraging researchers to explore phrase-level modeling. This could lead to more sophisticated music generation systems that better capture the nuances of musical composition. The approach may also inspire applications in interactive music systems and automated composition tools. The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Primary: Peking University
All Institutions: Peking University, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
The paper introduces BRACE, a benchmark specifically designed for evaluating audio captioning metrics in a reference-free setting. It comprises two sub-benchmarks, BRACE-Main and BRACE-Hallucination, which assess fine-grained caption comparisons and hallucination detection, respectively. The methodology is robust, utilizing a combination of high-quality filtering, LLM-based corruption, and human annotation to construct datasets. The dual focus on both the quality of audio-caption alignment and the detection of hallucinations presents a comprehensive approach to addressing existing gaps in audio caption evaluation metrics. The use of diverse models and evaluation strategies enhances the credibility of the findings.
The experiments conducted on BRACE benchmark reveal significant insights into the performance of CLAP-based ACEMs and LALMs. The results indicate that even the best-performing models struggle to achieve high scores, highlighting the challenges in audio caption evaluation. The evaluation metrics are well-defined, and the performance of various models is systematically compared, providing a clear understanding of their limitations. The rigorous testing across different model architectures adds depth to the experimental evaluation.
The authors have taken steps to ensure reproducibility by providing access to the evaluation code and benchmark datasets. Detailed descriptions of the experimental configurations, including model settings and evaluation strategies, are included. However, the paper could benefit from more explicit instructions on how to replicate the experiments, particularly regarding the specific prompts and configurations used in LALM evaluations.
The paper acknowledges certain limitations, particularly regarding the performance of existing models on the benchmark. However, it could further elaborate on potential biases in the dataset construction process and the implications of using LLMs for generating and corrupting captions. Additionally, the computational constraints faced during experiments limit the ability to conduct extensive evaluations, which could affect the generalizability of the results.
The development of BRACE has significant implications for the field of audio understanding, particularly in enhancing accessibility and content indexing. By providing a reliable benchmark for evaluating audio captioning metrics, it can drive improvements in model development and evaluation practices. However, the potential for misuse of audio captioning technologies, such as generating misleading or inaccurate captions, should be considered, and appropriate safeguards should be discussed. The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
The methodology presented in this paper is innovative, combining mobile room impulse response (RIR) capture with a visual-assisted acoustic field model. The use of commodity smartphones for constructing audio-visual digital twins is a significant advancement, as it democratizes access to advanced acoustic modeling techniques. The differentiable acoustic rendering for recovering surface material properties is a notable technical contribution, allowing for real-time modifications and updates to both audio and visual components. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their computational efficiency.
The experimental evaluation is thorough, showcasing the effectiveness of the AV-Twin system in various scenarios. The authors provide quantitative metrics for the accuracy of acoustic reconstructions and the fidelity of the visual outputs. However, the datasets used for evaluation are not extensively described, which raises questions about the generalizability of the results. More diverse environments and material types could enhance the robustness of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the authors mention the use of smartphones, they do not provide specifics on the hardware or software configurations used in their experiments. Additionally, the absence of a public code repository or demo URL limits the ability of other researchers to validate the findings independently.
One limitation of the study is the reliance on commodity smartphones, which may introduce variability in the quality of the captured data. Furthermore, the system's performance may be constrained by the physical limitations of the devices used, such as microphone sensitivity and processing power. The paper also does not address potential challenges in real-world applications, such as varying environmental conditions and user expertise.
The potential applications of AV-Twin are vast, ranging from virtual reality environments to architectural design and acoustic engineering. By enabling users to create and modify audio-visual digital twins easily, this work could significantly enhance user interaction and experience in various fields. The approach could also inspire further research into integrating acoustics with other sensory modalities in digital twin technologies. The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo; Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
The methodology presented in MR-FlowDPO is innovative, leveraging Direct Preference Optimization (DPO) to align music generation with human preferences. The approach of using multiple musical rewards to evaluate text alignment, audio production quality, and semantic consistency is well-structured. The integration of scalable off-the-shelf models for reward prediction is a practical choice that enhances the model's applicability. However, the paper could benefit from a more detailed explanation of the scoring mechanism and how it specifically improves rhythmic stability.
The experiments conducted are extensive, utilizing both objective metrics and human evaluations to assess the effectiveness of the proposed model. The results indicate a significant improvement over competitive baselines, which strengthens the claims made in the paper. However, the paper lacks a detailed description of the datasets used, which is crucial for understanding the generalizability of the findings.
The authors provide links to their code and demo page, which is a positive aspect for reproducibility. However, the paper does not sufficiently detail the experimental setup, including hyperparameters and training procedures, which may hinder full reproducibility by other researchers.
One limitation is the potential subjectivity in human evaluations, which can vary widely among individuals. Additionally, the reliance on off-the-shelf models for reward prediction may introduce biases based on the limitations of those models. The paper could also explore the scalability of the approach in real-world applications beyond the experimental settings.
The implications of this research are significant for the field of music generation, as it addresses the subjective nature of music evaluation and aims to create models that better align with human preferences. This could lead to more personalized music generation applications, enhancing user experience in various domains such as entertainment and therapy. The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics. Our results establish that semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics.
Primary: Northeastern University
All Institutions: Northeastern University
The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
The paper introduces a novel framework for confidence calibration in automated audio captioning that integrates a learned confidence prediction head with a Whisper-based model. This approach is innovative as it shifts the focus from traditional n-gram overlap metrics to semantic similarity for evaluating correctness, which is a significant advancement in the field. The architecture is well-defined, with clear descriptions of the confidence prediction head, temperature scaling, and confidence-guided beam search. The methodology is robust and addresses existing limitations in audio captioning systems effectively.
The experiments conducted on the Clotho v2 dataset are comprehensive, demonstrating substantial improvements in both calibration and caption quality metrics. The results are compelling, with a dramatic reduction in Expected Calibration Error (ECE) from 0.488 to 0.071, showcasing the effectiveness of the proposed method. Additionally, the paper provides quantitative results across multiple evaluation metrics (BLEU, CIDEr, CLAP similarity), which strengthens the validity of the findings.
The implementation details are adequately described, including the model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits reproducibility. Future work should consider making the code accessible to facilitate validation of results by the research community.
The paper acknowledges several limitations, including the somewhat arbitrary threshold for semantic correctness and the evaluation being limited to the Clotho dataset. The authors also note that the confidence head may not capture all sources of uncertainty, suggesting areas for future exploration. These limitations are important to consider for the generalization of the findings.
The proposed framework has significant implications for real-world applications of automated audio captioning, particularly in accessibility technologies and content indexing. By improving the reliability of predictions, this work could enhance user trust in automated systems, leading to broader adoption in various domains. The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time-frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.
Primary: Institute of Neuroinformatics, University of Zurich and ETH Zurich
All Institutions: Institute of Neuroinformatics, University of Zurich and ETH Zurich, Institute for the Interdisciplinary Study of Language Evolution, University of Zurich
The paper introduces VocSim, a training-free benchmark for evaluating zero-shot content identity in audio representations, significantly contributing to the field by providing a rigorous framework for assessing the intrinsic quality of audio embeddings. The comprehensive methodology and experimental validation enhance its relevance and potential impact on future research in audio processing and machine learning.
The authors present a novel benchmark, VocSim, designed to evaluate the intrinsic geometric alignment of audio embeddings in a zero-shot setting. The methodology is robust, employing a large dataset of 125k single-source audio clips aggregated from diverse corpora, which allows for rigorous testing of generalization capabilities. The use of training-free metrics, such as Precision@k and Global Separation Rate (GSR), is innovative and addresses the limitations of existing benchmarks that rely on supervised learning paradigms. The transductive PCA approach to mitigate anisotropy in embedding spaces is a thoughtful addition that enhances the evaluation process.
The experiments are comprehensive, evaluating multiple foundation models and providing a detailed analysis of their performance across different audio domains. The results reveal significant insights into the generalization capabilities of these models, particularly highlighting a notable performance drop on blind, low-resource speech datasets. The external validation of the embeddings' utility in predicting avian perceptual similarity and achieving state-of-the-art results on the HEAR benchmark further underscores the practical implications of the findings.
The paper includes a clear description of the experimental setup, data preprocessing, and evaluation metrics, which facilitates reproducibility. The authors have made their code and dataset publicly available, enhancing the transparency and accessibility of their research. However, the reliance on transductive PCA may complicate strict adherence to zero-shot evaluation protocols, which could be a point of contention for some researchers.
The primary limitation noted is the generalization gap observed in low-resource speech, indicating that current models may not generalize well across different phonotactics. Additionally, the benchmark's focus on single-source audio excludes polyphonic scenarios, which may limit its applicability in real-world contexts. The ethical considerations regarding data sovereignty and the handling of indigenous language data are also acknowledged, which is commendable but reflects the complexities involved in such research.
The implications of this research are significant, particularly in addressing biases in low-resource languages and the potential perpetuation of digital divides in audio processing technologies. By highlighting the performance disparities of state-of-the-art models on underrepresented languages, the authors advocate for more equitable advancements in machine learning applications. The benchmark also serves as a tool for future research to improve audio representation models, fostering advancements in bioacoustics and environmental sound classification. The paper introduces VocSim, a training-free benchmark for evaluating zero-shot content identity in audio representations, significantly contributing to the field by providing a rigorous framework for assessing the intrinsic quality of audio embeddings. The comprehensive methodology and experimental validation enhance its relevance and potential impact on future research in audio processing and machine learning.
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
Primary: Florida State University
All Institutions: Florida State University, University of Tennessee at Chattanooga
The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
The paper presents an innovative unsupervised approach to eavesdropping on in-person conversations using mmWave sensing technology. The methodology is well-structured, addressing significant challenges such as low resolution of vibration signals and interference from static objects. The authors propose a multi-module system that includes a speech-aware calibration scheme, a noise-robust signal processing pipeline, and a deep learning framework for signal enhancement. The unsupervised clustering method for speaker distinction is particularly noteworthy as it allows for effective speaker attribution without prior knowledge of the number of speakers or their identities.
The experimental validation is extensive, demonstrating the effectiveness of the proposed attack in various real-world scenarios, including different object materials and speaker arrangements. The reported success rates for speaker distinction (up to 0.99) and consistent signal enhancement across setups provide strong evidence of the method's robustness. However, the reliance on synthetic datasets for training may raise questions about the generalizability of the results.
While the paper provides detailed descriptions of the experimental setup and methodology, there is a lack of publicly available code or datasets, which could hinder reproducibility. The authors mention using a specific radar system and a synthetic dataset, but without access to these resources, independent verification of results may be challenging.
The study does not address potential ethical concerns associated with the proposed eavesdropping technique. Additionally, while the method shows promise in controlled environments, its effectiveness in more complex, real-world scenarios with varying noise levels and object types remains uncertain. The performance may degrade significantly in less ideal conditions, which is not thoroughly explored in the paper.
The implications of this research are significant, as it highlights potential privacy risks associated with passive sensing technologies in shared environments. The ability to eavesdrop on sensitive conversations raises ethical concerns regarding surveillance and data privacy. This work could inform future regulations and security measures to protect against such vulnerabilities in various settings, including corporate and healthcare environments. The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China
The paper presents DMP-TTS, a controllable TTS framework that effectively disentangles speaker timbre and speaking style through innovative methodologies. The contributions are significant, addressing key challenges in the field and providing a solid foundation for future research and applications in controllable speech synthesis.
The paper introduces DMP-TTS, a novel framework for controllable TTS that utilizes a latent Diffusion Transformer architecture, which is innovative in its approach to disentangling speaker timbre and speaking style. The use of a CLAP-based style encoder (Style-CLAP) for aligning audio and text cues is a significant methodological advancement. Additionally, the introduction of chained classifier-free guidance (cCFG) allows for independent control of multiple attributes, which is a notable improvement over existing methods. The methodology is well-structured, with clear explanations of the components and their interactions, although some technical details could benefit from further elaboration.
The experiments conducted are thorough, utilizing a high-quality dataset of approximately 300 hours of Chinese speech. The evaluation metrics are appropriate, including both objective measures (WER, speaker similarity) and subjective measures (NMOS, QMOS). The results demonstrate that DMP-TTS outperforms existing baselines in terms of style controllability while maintaining competitive intelligibility and naturalness. However, the paper could have benefited from a more detailed discussion of the statistical significance of the results.
The implementation details are provided, including the architecture, training procedure, and hyperparameters, which enhance reproducibility. The authors mention that code and demos will be available, which is essential for the community to validate and build upon their work. However, the absence of a direct link to a code repository limits immediate access to the implementation.
While the paper presents a strong framework, it does not address potential limitations in terms of scalability to larger datasets or multilingual capabilities. Additionally, the reliance on a specific pre-trained model (Whisper) for representation alignment may limit the generalizability of the approach. The paper also notes that speaker similarity metrics are lower than some baselines, suggesting room for improvement in that area.
The advancements in controllable TTS have significant implications for applications in virtual assistants, audiobooks, and any domain requiring personalized speech synthesis. The ability to independently manipulate style and timbre enhances user experience and could lead to more engaging human-computer interactions. The work may also inspire further research into disentangled representations in other domains of machine learning. The paper presents DMP-TTS, a controllable TTS framework that effectively disentangles speaker timbre and speaking style through innovative methodologies. The contributions are significant, addressing key challenges in the field and providing a solid foundation for future research and applications in controllable speech synthesis.
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\%, 4.60\%, and 4.80\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\%, 3.52\%, and 4.38\% across the same partitions.
Primary: Brno University of Technology
All Institutions: The Hong Kong Polytechnic University, Brno University of Technology, Johns Hopkins University
This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
The paper presents a robust ensemble framework for Environmental Sound Deepfake Detection, leveraging diverse Self-Supervised Learning (SSL) models and a Multi-Head Factorized Attention (MHFA) mechanism. The integration of feature domain augmentation based on distribution uncertainty modeling is particularly innovative, enhancing the model's robustness against unseen spectral distortions. The systematic comparison between general audio and speech-specific SSL models is well-structured, providing valuable insights into their respective performances in the context of deepfake detection.
The experimental setup is rigorous, utilizing the official EnvSDD dataset exclusively, which ensures that the results are directly relevant to the challenge. The reported Equal Error Rates (EER) demonstrate significant improvements over baseline models, particularly with the ensemble system achieving an EER of 3.52%. The results are well-documented, and the analysis of the impact of different SSL models and augmentation strategies is thorough.
The paper provides sufficient implementation details, including the training configuration, optimizer settings, and model architecture. The use of an open-source framework (WeDefense) for implementation enhances the reproducibility of the results. However, the lack of access to the dataset used for training may limit full reproducibility for external researchers.
While the approach shows promising results, the reliance on a single dataset may limit the generalizability of the findings. Additionally, the performance metrics primarily focus on EER, which may not capture all aspects of model performance, such as precision and recall in real-world applications.
The proposed methods have significant implications for the field of audio deepfake detection, particularly in applications related to security and content verification. By addressing the challenges of generalization to unseen generators, this work contributes to the ongoing efforts to develop robust anti-spoofing technologies in audio processing. This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.
Primary: UNC Chapel Hill
All Institutions: UNC Chapel Hill
The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
The methodology presented in the paper is innovative as it combines audio encoding with a language model to create a multimodal QA system for speech quality assessment. The use of template-based question-answer pairs to train the model on the NISQA corpus is a thoughtful approach that allows for flexibility in querying and understanding audio quality metrics. The model's ability to generate textual answers rather than fixed scores enhances interpretability and user interaction, which is a significant advancement over traditional methods.
The experiments conducted on held-out NISQA clips demonstrate solid performance, with a mean absolute error of 0.41 and a Pearson correlation of 0.86, indicating a strong correlation with human ratings. The competitive performance on dimension-wise tasks further validates the effectiveness of the proposed system. However, the paper could benefit from a more extensive evaluation across different datasets and real-world scenarios to fully establish its robustness.
The authors provide a GitHub repository with code, model weights, and experimental results, which is a positive aspect for reproducibility. However, the paper could enhance reproducibility by including more detailed descriptions of hyperparameters, training procedures, and evaluation metrics used in the experiments.
One limitation is the reliance on the NISQA corpus, which may not encompass all possible audio quality scenarios encountered in real-world applications. Additionally, while the model offers flexibility in querying, it may still struggle with edge cases or highly subjective audio quality assessments that require nuanced human judgment.
The potential applications of SpeechQualityLLM are significant, particularly in telephony, VoIP, and streaming services, where audio quality is paramount. By reducing reliance on costly human evaluations and enabling interactive queries, this system could streamline quality assessment processes and improve user experiences across various platforms. The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhancing the robustness of speech transmission by means of joint source-channel coding (JSCC). However, its cross-layer design hinders practical deployment due to the incompatibility with existing digital communication systems. In this case, the robustness of speech communication is consequently evaluated primarily by the error-resilience to packet loss over wireless networks. To address these challenges, we propose \emph{Glaris}, a generative latent-prior-based resilient speech semantic communication framework that performs resilient speech coding in the generative latent space. Generative latent priors enable high-quality packet loss concealment (PLC) at the receiver side, well-balancing semantic consistency and reconstruction fidelity. Additionally, an integrated error resilience mechanism is designed to mitigate the error propagation and improve the effectiveness of PLC. Compared with traditional packet-level forward error correction (FEC) strategies, our new method achieves enhanced robustness over dynamic wireless networks while reducing redundancy overhead significantly. Experimental results on the LibriSpeech dataset demonstrate that \emph{Glaris} consistently outperforms existing error-resilient codecs, achieving JSCC-level robustness while maintaining seamless compatibility with existing systems, and it also strikes a favorable balance between transmission efficiency and speech reconstruction quality.
Primary: Guilin University of Electronic Technology
All Institutions: Guilin University of Electronic Technology, Beijing University of Posts and Telecommunications
The main contribution of this paper is the introduction of Glaris, a generative latent-prior-based framework for resilient speech communication that effectively balances semantic consistency and reconstruction fidelity in the presence of packet loss. The comprehensive methodology and experimental validation demonstrate its potential to advance the field of error-resilient communication systems.
The paper introduces Glaris, a novel framework for error-resilient semantic communication in speech transmission over packet-loss networks. It employs a two-stage coding architecture utilizing generative latent priors to enhance both semantic consistency and reconstruction fidelity. The integration of in-band forward error correction (FEC) with packet loss concealment (PLC) is a significant methodological advancement, as it allows for adaptive redundancy control and improved resilience against dynamic channel conditions. The use of a VQ-VAE for encoding high-dimensional speech into a compact latent representation is well-justified and effectively addresses the challenges of maintaining quality under packet loss.
The experimental evaluation is thorough, utilizing the LibriSpeech dataset under various packet-loss conditions, including both simulated and real-world scenarios. The results demonstrate that Glaris outperforms existing codecs in terms of robustness and efficiency. The use of multiple metrics (PESQ, STOI, WER, and MOS) provides a comprehensive assessment of performance, and the subjective listening tests further validate the framework's effectiveness. However, the paper could benefit from a more detailed comparison with additional state-of-the-art methods to strengthen the claims of superiority.
The paper provides a detailed methodology and experimental setup, including training parameters and loss functions, which enhances reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for others to replicate the results directly. Including such resources would significantly improve the paper's reproducibility.
One limitation is the potential trade-off between redundancy and compression efficiency, particularly in low-bitrate scenarios where excessive redundancy may degrade quality. Additionally, while the framework shows promise, its performance in highly variable real-world conditions remains to be fully validated. The reliance on specific datasets may also limit the generalizability of the findings.
The proposed framework has significant implications for real-time speech communication applications, particularly in scenarios where packet loss is common, such as VoIP and online meetings. By improving the robustness and efficiency of speech transmission, Glaris could enhance user experience in various latency-sensitive applications. The integration of semantic communication principles may also pave the way for future advancements in other modalities, such as video and text transmission. The main contribution of this paper is the introduction of Glaris, a generative latent-prior-based framework for resilient speech communication that effectively balances semantic consistency and reconstruction fidelity in the presence of packet loss. The comprehensive methodology and experimental validation demonstrate its potential to advance the field of error-resilient communication systems.