Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
Primary: Florida State University
All Institutions: Florida State University, University of Tennessee at Chattanooga
The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
The paper presents an innovative unsupervised approach to eavesdropping on in-person conversations using mmWave sensing technology. The methodology is well-structured, addressing significant challenges such as low resolution of vibration signals and interference from static objects. The authors propose a multi-module system that includes a speech-aware calibration scheme, a noise-robust signal processing pipeline, and a deep learning framework for signal enhancement. The unsupervised clustering method for speaker distinction is particularly noteworthy as it allows for effective speaker attribution without prior knowledge of the number of speakers or their identities.
The experimental validation is extensive, demonstrating the effectiveness of the proposed attack in various real-world scenarios, including different object materials and speaker arrangements. The reported success rates for speaker distinction (up to 0.99) and consistent signal enhancement across setups provide strong evidence of the method's robustness. However, the reliance on synthetic datasets for training may raise questions about the generalizability of the results.
While the paper provides detailed descriptions of the experimental setup and methodology, there is a lack of publicly available code or datasets, which could hinder reproducibility. The authors mention using a specific radar system and a synthetic dataset, but without access to these resources, independent verification of results may be challenging.
The study does not address potential ethical concerns associated with the proposed eavesdropping technique. Additionally, while the method shows promise in controlled environments, its effectiveness in more complex, real-world scenarios with varying noise levels and object types remains uncertain. The performance may degrade significantly in less ideal conditions, which is not thoroughly explored in the paper.
The implications of this research are significant, as it highlights potential privacy risks associated with passive sensing technologies in shared environments. The ability to eavesdrop on sensitive conversations raises ethical concerns regarding surveillance and data privacy. This work could inform future regulations and security measures to protect against such vulnerabilities in various settings, including corporate and healthcare environments. The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\%, 4.60\%, and 4.80\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\%, 3.52\%, and 4.38\% across the same partitions.
Primary: Brno University of Technology
All Institutions: The Hong Kong Polytechnic University, Brno University of Technology, Johns Hopkins University
This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
The paper presents a robust ensemble framework for Environmental Sound Deepfake Detection, leveraging diverse Self-Supervised Learning (SSL) models and a Multi-Head Factorized Attention (MHFA) mechanism. The integration of feature domain augmentation based on distribution uncertainty modeling is particularly innovative, enhancing the model's robustness against unseen spectral distortions. The systematic comparison between general audio and speech-specific SSL models is well-structured, providing valuable insights into their respective performances in the context of deepfake detection.
The experimental setup is rigorous, utilizing the official EnvSDD dataset exclusively, which ensures that the results are directly relevant to the challenge. The reported Equal Error Rates (EER) demonstrate significant improvements over baseline models, particularly with the ensemble system achieving an EER of 3.52%. The results are well-documented, and the analysis of the impact of different SSL models and augmentation strategies is thorough.
The paper provides sufficient implementation details, including the training configuration, optimizer settings, and model architecture. The use of an open-source framework (WeDefense) for implementation enhances the reproducibility of the results. However, the lack of access to the dataset used for training may limit full reproducibility for external researchers.
While the approach shows promising results, the reliance on a single dataset may limit the generalizability of the findings. Additionally, the performance metrics primarily focus on EER, which may not capture all aspects of model performance, such as precision and recall in real-world applications.
The proposed methods have significant implications for the field of audio deepfake detection, particularly in applications related to security and content verification. By addressing the challenges of generalization to unseen generators, this work contributes to the ongoing efforts to develop robust anti-spoofing technologies in audio processing. This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.
Primary: UNC Chapel Hill
All Institutions: UNC Chapel Hill
The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
The methodology presented in the paper is innovative as it combines audio encoding with a language model to create a multimodal QA system for speech quality assessment. The use of template-based question-answer pairs to train the model on the NISQA corpus is a thoughtful approach that allows for flexibility in querying and understanding audio quality metrics. The model's ability to generate textual answers rather than fixed scores enhances interpretability and user interaction, which is a significant advancement over traditional methods.
The experiments conducted on held-out NISQA clips demonstrate solid performance, with a mean absolute error of 0.41 and a Pearson correlation of 0.86, indicating a strong correlation with human ratings. The competitive performance on dimension-wise tasks further validates the effectiveness of the proposed system. However, the paper could benefit from a more extensive evaluation across different datasets and real-world scenarios to fully establish its robustness.
The authors provide a GitHub repository with code, model weights, and experimental results, which is a positive aspect for reproducibility. However, the paper could enhance reproducibility by including more detailed descriptions of hyperparameters, training procedures, and evaluation metrics used in the experiments.
One limitation is the reliance on the NISQA corpus, which may not encompass all possible audio quality scenarios encountered in real-world applications. Additionally, while the model offers flexibility in querying, it may still struggle with edge cases or highly subjective audio quality assessments that require nuanced human judgment.
The potential applications of SpeechQualityLLM are significant, particularly in telephony, VoIP, and streaming services, where audio quality is paramount. By reducing reliance on costly human evaluations and enabling interactive queries, this system could streamline quality assessment processes and improve user experiences across various platforms. The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Primary: Peking University
All Institutions: Peking University, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
The paper introduces BRACE, a benchmark specifically designed for evaluating audio captioning metrics in a reference-free setting. It comprises two sub-benchmarks, BRACE-Main and BRACE-Hallucination, which assess fine-grained caption comparisons and hallucination detection, respectively. The methodology is robust, utilizing a combination of high-quality filtering, LLM-based corruption, and human annotation to construct datasets. The dual focus on both the quality of audio-caption alignment and the detection of hallucinations presents a comprehensive approach to addressing existing gaps in audio caption evaluation metrics. The use of diverse models and evaluation strategies enhances the credibility of the findings.
The experiments conducted on BRACE benchmark reveal significant insights into the performance of CLAP-based ACEMs and LALMs. The results indicate that even the best-performing models struggle to achieve high scores, highlighting the challenges in audio caption evaluation. The evaluation metrics are well-defined, and the performance of various models is systematically compared, providing a clear understanding of their limitations. The rigorous testing across different model architectures adds depth to the experimental evaluation.
The authors have taken steps to ensure reproducibility by providing access to the evaluation code and benchmark datasets. Detailed descriptions of the experimental configurations, including model settings and evaluation strategies, are included. However, the paper could benefit from more explicit instructions on how to replicate the experiments, particularly regarding the specific prompts and configurations used in LALM evaluations.
The paper acknowledges certain limitations, particularly regarding the performance of existing models on the benchmark. However, it could further elaborate on potential biases in the dataset construction process and the implications of using LLMs for generating and corrupting captions. Additionally, the computational constraints faced during experiments limit the ability to conduct extensive evaluations, which could affect the generalizability of the results.
The development of BRACE has significant implications for the field of audio understanding, particularly in enhancing accessibility and content indexing. By providing a reliable benchmark for evaluating audio captioning metrics, it can drive improvements in model development and evaluation practices. However, the potential for misuse of audio captioning technologies, such as generating misleading or inaccurate captions, should be considered, and appropriate safeguards should be discussed. The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
The methodology presented in this paper is innovative, combining mobile room impulse response (RIR) capture with a visual-assisted acoustic field model. The use of commodity smartphones for constructing audio-visual digital twins is a significant advancement, as it democratizes access to advanced acoustic modeling techniques. The differentiable acoustic rendering for recovering surface material properties is a notable technical contribution, allowing for real-time modifications and updates to both audio and visual components. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their computational efficiency.
The experimental evaluation is thorough, showcasing the effectiveness of the AV-Twin system in various scenarios. The authors provide quantitative metrics for the accuracy of acoustic reconstructions and the fidelity of the visual outputs. However, the datasets used for evaluation are not extensively described, which raises questions about the generalizability of the results. More diverse environments and material types could enhance the robustness of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the authors mention the use of smartphones, they do not provide specifics on the hardware or software configurations used in their experiments. Additionally, the absence of a public code repository or demo URL limits the ability of other researchers to validate the findings independently.
One limitation of the study is the reliance on commodity smartphones, which may introduce variability in the quality of the captured data. Furthermore, the system's performance may be constrained by the physical limitations of the devices used, such as microphone sensitivity and processing power. The paper also does not address potential challenges in real-world applications, such as varying environmental conditions and user expertise.
The potential applications of AV-Twin are vast, ranging from virtual reality environments to architectural design and acoustic engineering. By enabling users to create and modify audio-visual digital twins easily, this work could significantly enhance user interaction and experience in various fields. The approach could also inspire further research into integrating acoustics with other sensory modalities in digital twin technologies. The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo; Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
The methodology presented in MR-FlowDPO is innovative, leveraging Direct Preference Optimization (DPO) to align music generation with human preferences. The approach of using multiple musical rewards to evaluate text alignment, audio production quality, and semantic consistency is well-structured. The integration of scalable off-the-shelf models for reward prediction is a practical choice that enhances the model's applicability. However, the paper could benefit from a more detailed explanation of the scoring mechanism and how it specifically improves rhythmic stability.
The experiments conducted are extensive, utilizing both objective metrics and human evaluations to assess the effectiveness of the proposed model. The results indicate a significant improvement over competitive baselines, which strengthens the claims made in the paper. However, the paper lacks a detailed description of the datasets used, which is crucial for understanding the generalizability of the findings.
The authors provide links to their code and demo page, which is a positive aspect for reproducibility. However, the paper does not sufficiently detail the experimental setup, including hyperparameters and training procedures, which may hinder full reproducibility by other researchers.
One limitation is the potential subjectivity in human evaluations, which can vary widely among individuals. Additionally, the reliance on off-the-shelf models for reward prediction may introduce biases based on the limitations of those models. The paper could also explore the scalability of the approach in real-world applications beyond the experimental settings.
The implications of this research are significant for the field of music generation, as it addresses the subjective nature of music evaluation and aims to create models that better align with human preferences. This could lead to more personalized music generation applications, enhancing user experience in various domains such as entertainment and therapy. The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics. Our results establish that semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics.
Primary: Northeastern University
All Institutions: Northeastern University
The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
The paper introduces a novel framework for confidence calibration in automated audio captioning that integrates a learned confidence prediction head with a Whisper-based model. This approach is innovative as it shifts the focus from traditional n-gram overlap metrics to semantic similarity for evaluating correctness, which is a significant advancement in the field. The architecture is well-defined, with clear descriptions of the confidence prediction head, temperature scaling, and confidence-guided beam search. The methodology is robust and addresses existing limitations in audio captioning systems effectively.
The experiments conducted on the Clotho v2 dataset are comprehensive, demonstrating substantial improvements in both calibration and caption quality metrics. The results are compelling, with a dramatic reduction in Expected Calibration Error (ECE) from 0.488 to 0.071, showcasing the effectiveness of the proposed method. Additionally, the paper provides quantitative results across multiple evaluation metrics (BLEU, CIDEr, CLAP similarity), which strengthens the validity of the findings.
The implementation details are adequately described, including the model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits reproducibility. Future work should consider making the code accessible to facilitate validation of results by the research community.
The paper acknowledges several limitations, including the somewhat arbitrary threshold for semantic correctness and the evaluation being limited to the Clotho dataset. The authors also note that the confidence head may not capture all sources of uncertainty, suggesting areas for future exploration. These limitations are important to consider for the generalization of the findings.
The proposed framework has significant implications for real-world applications of automated audio captioning, particularly in accessibility technologies and content indexing. By improving the reliability of predictions, this work could enhance user trust in automated systems, leading to broader adoption in various domains. The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time-frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.
Primary: Institute of Neuroinformatics, University of Zurich and ETH Zurich
All Institutions: Institute of Neuroinformatics, University of Zurich and ETH Zurich, Institute for the Interdisciplinary Study of Language Evolution, University of Zurich
The paper introduces VocSim, a training-free benchmark for evaluating zero-shot content identity in audio representations, significantly contributing to the field by providing a rigorous framework for assessing the intrinsic quality of audio embeddings. The comprehensive methodology and experimental validation enhance its relevance and potential impact on future research in audio processing and machine learning.
The authors present a novel benchmark, VocSim, designed to evaluate the intrinsic geometric alignment of audio embeddings in a zero-shot setting. The methodology is robust, employing a large dataset of 125k single-source audio clips aggregated from diverse corpora, which allows for rigorous testing of generalization capabilities. The use of training-free metrics, such as Precision@k and Global Separation Rate (GSR), is innovative and addresses the limitations of existing benchmarks that rely on supervised learning paradigms. The transductive PCA approach to mitigate anisotropy in embedding spaces is a thoughtful addition that enhances the evaluation process.
The experiments are comprehensive, evaluating multiple foundation models and providing a detailed analysis of their performance across different audio domains. The results reveal significant insights into the generalization capabilities of these models, particularly highlighting a notable performance drop on blind, low-resource speech datasets. The external validation of the embeddings' utility in predicting avian perceptual similarity and achieving state-of-the-art results on the HEAR benchmark further underscores the practical implications of the findings.
The paper includes a clear description of the experimental setup, data preprocessing, and evaluation metrics, which facilitates reproducibility. The authors have made their code and dataset publicly available, enhancing the transparency and accessibility of their research. However, the reliance on transductive PCA may complicate strict adherence to zero-shot evaluation protocols, which could be a point of contention for some researchers.
The primary limitation noted is the generalization gap observed in low-resource speech, indicating that current models may not generalize well across different phonotactics. Additionally, the benchmark's focus on single-source audio excludes polyphonic scenarios, which may limit its applicability in real-world contexts. The ethical considerations regarding data sovereignty and the handling of indigenous language data are also acknowledged, which is commendable but reflects the complexities involved in such research.
The implications of this research are significant, particularly in addressing biases in low-resource languages and the potential perpetuation of digital divides in audio processing technologies. By highlighting the performance disparities of state-of-the-art models on underrepresented languages, the authors advocate for more equitable advancements in machine learning applications. The benchmark also serves as a tool for future research to improve audio representation models, fostering advancements in bioacoustics and environmental sound classification. The paper introduces VocSim, a training-free benchmark for evaluating zero-shot content identity in audio representations, significantly contributing to the field by providing a rigorous framework for assessing the intrinsic quality of audio embeddings. The comprehensive methodology and experimental validation enhance its relevance and potential impact on future research in audio processing and machine learning.
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
Primary: Florida State University
All Institutions: Florida State University, University of Tennessee at Chattanooga
The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
The paper presents an innovative unsupervised approach to eavesdropping on in-person conversations using mmWave sensing technology. The methodology is well-structured, addressing significant challenges such as low resolution of vibration signals and interference from static objects. The authors propose a multi-module system that includes a speech-aware calibration scheme, a noise-robust signal processing pipeline, and a deep learning framework for signal enhancement. The unsupervised clustering method for speaker distinction is particularly noteworthy as it allows for effective speaker attribution without prior knowledge of the number of speakers or their identities.
The experimental validation is extensive, demonstrating the effectiveness of the proposed attack in various real-world scenarios, including different object materials and speaker arrangements. The reported success rates for speaker distinction (up to 0.99) and consistent signal enhancement across setups provide strong evidence of the method's robustness. However, the reliance on synthetic datasets for training may raise questions about the generalizability of the results.
While the paper provides detailed descriptions of the experimental setup and methodology, there is a lack of publicly available code or datasets, which could hinder reproducibility. The authors mention using a specific radar system and a synthetic dataset, but without access to these resources, independent verification of results may be challenging.
The study does not address potential ethical concerns associated with the proposed eavesdropping technique. Additionally, while the method shows promise in controlled environments, its effectiveness in more complex, real-world scenarios with varying noise levels and object types remains uncertain. The performance may degrade significantly in less ideal conditions, which is not thoroughly explored in the paper.
The implications of this research are significant, as it highlights potential privacy risks associated with passive sensing technologies in shared environments. The ability to eavesdrop on sensitive conversations raises ethical concerns regarding surveillance and data privacy. This work could inform future regulations and security measures to protect against such vulnerabilities in various settings, including corporate and healthcare environments. The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China
The paper presents DMP-TTS, a controllable TTS framework that effectively disentangles speaker timbre and speaking style through innovative methodologies. The contributions are significant, addressing key challenges in the field and providing a solid foundation for future research and applications in controllable speech synthesis.
The paper introduces DMP-TTS, a novel framework for controllable TTS that utilizes a latent Diffusion Transformer architecture, which is innovative in its approach to disentangling speaker timbre and speaking style. The use of a CLAP-based style encoder (Style-CLAP) for aligning audio and text cues is a significant methodological advancement. Additionally, the introduction of chained classifier-free guidance (cCFG) allows for independent control of multiple attributes, which is a notable improvement over existing methods. The methodology is well-structured, with clear explanations of the components and their interactions, although some technical details could benefit from further elaboration.
The experiments conducted are thorough, utilizing a high-quality dataset of approximately 300 hours of Chinese speech. The evaluation metrics are appropriate, including both objective measures (WER, speaker similarity) and subjective measures (NMOS, QMOS). The results demonstrate that DMP-TTS outperforms existing baselines in terms of style controllability while maintaining competitive intelligibility and naturalness. However, the paper could have benefited from a more detailed discussion of the statistical significance of the results.
The implementation details are provided, including the architecture, training procedure, and hyperparameters, which enhance reproducibility. The authors mention that code and demos will be available, which is essential for the community to validate and build upon their work. However, the absence of a direct link to a code repository limits immediate access to the implementation.
While the paper presents a strong framework, it does not address potential limitations in terms of scalability to larger datasets or multilingual capabilities. Additionally, the reliance on a specific pre-trained model (Whisper) for representation alignment may limit the generalizability of the approach. The paper also notes that speaker similarity metrics are lower than some baselines, suggesting room for improvement in that area.
The advancements in controllable TTS have significant implications for applications in virtual assistants, audiobooks, and any domain requiring personalized speech synthesis. The ability to independently manipulate style and timbre enhances user experience and could lead to more engaging human-computer interactions. The work may also inspire further research into disentangled representations in other domains of machine learning. The paper presents DMP-TTS, a controllable TTS framework that effectively disentangles speaker timbre and speaking style through innovative methodologies. The contributions are significant, addressing key challenges in the field and providing a solid foundation for future research and applications in controllable speech synthesis.
This paper describes the BUT submission to the ESDD 2026 Challenge, specifically focusing on Track 1: Environmental Sound Deepfake Detection with Unseen Generators. To address the critical challenge of generalizing to audio generated by unseen synthesis algorithms, we propose a robust ensemble framework leveraging diverse Self-Supervised Learning (SSL) models. We conduct a comprehensive analysis of general audio SSL models (including BEATs, EAT, and Dasheng) and speech-specific SSLs. These front-ends are coupled with a lightweight Multi-Head Factorized Attention (MHFA) back-end to capture discriminative representations. Furthermore, we introduce a feature domain augmentation strategy based on distribution uncertainty modeling to enhance model robustness against unseen spectral distortions. All models are trained exclusively on the official EnvSDD data, without using any external resources. Experimental results demonstrate the effectiveness of our approach: our best single system achieved Equal Error Rates (EER) of 0.00\%, 4.60\%, and 4.80\% on the Development, Progress (Track 1), and Final Evaluation sets, respectively. The fusion system further improved generalization, yielding EERs of 0.00\%, 3.52\%, and 4.38\% across the same partitions.
Primary: Brno University of Technology
All Institutions: The Hong Kong Polytechnic University, Brno University of Technology, Johns Hopkins University
This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
The paper presents a robust ensemble framework for Environmental Sound Deepfake Detection, leveraging diverse Self-Supervised Learning (SSL) models and a Multi-Head Factorized Attention (MHFA) mechanism. The integration of feature domain augmentation based on distribution uncertainty modeling is particularly innovative, enhancing the model's robustness against unseen spectral distortions. The systematic comparison between general audio and speech-specific SSL models is well-structured, providing valuable insights into their respective performances in the context of deepfake detection.
The experimental setup is rigorous, utilizing the official EnvSDD dataset exclusively, which ensures that the results are directly relevant to the challenge. The reported Equal Error Rates (EER) demonstrate significant improvements over baseline models, particularly with the ensemble system achieving an EER of 3.52%. The results are well-documented, and the analysis of the impact of different SSL models and augmentation strategies is thorough.
The paper provides sufficient implementation details, including the training configuration, optimizer settings, and model architecture. The use of an open-source framework (WeDefense) for implementation enhances the reproducibility of the results. However, the lack of access to the dataset used for training may limit full reproducibility for external researchers.
While the approach shows promising results, the reliance on a single dataset may limit the generalizability of the findings. Additionally, the performance metrics primarily focus on EER, which may not capture all aspects of model performance, such as precision and recall in real-world applications.
The proposed methods have significant implications for the field of audio deepfake detection, particularly in applications related to security and content verification. By addressing the challenges of generalization to unseen generators, this work contributes to the ongoing efforts to develop robust anti-spoofing technologies in audio processing. This paper makes a significant contribution to the field of audio deepfake detection by introducing an innovative ensemble framework that effectively utilizes self-supervised learning models and advanced attention mechanisms. The methodology is well-founded, and the experimental results indicate a strong potential for real-world applications in audio security and verification.
Objective speech quality assessment is central to telephony, VoIP, and streaming systems, where large volumes of degraded audio must be monitored and optimized at scale. Classical metrics such as PESQ and POLQA approximate human mean opinion scores (MOS) but require carefully controlled conditions and expensive listening tests, while learning-based models such as NISQA regress MOS and multiple perceptual dimensions from waveforms or spectrograms, achieving high correlation with subjective ratings yet remaining rigid: they do not support interactive, natural-language queries and do not natively provide textual rationales. In this work, we introduce SpeechQualityLLM, a multimodal speech quality question-answering (QA) system that couples an audio encoder with a language model and is trained on the NISQA corpus using template-based question-answer pairs covering overall MOS and four perceptual dimensions (noisiness, coloration, discontinuity, and loudness) in both single-ended (degraded only) and double-ended (degraded plus clean reference) setups. Instead of directly regressing scores, our system is supervised to generate textual answers from which numeric predictions are parsed and evaluated with standard regression and ranking metrics; on held-out NISQA clips, the double-ended model attains a MOS mean absolute error (MAE) of 0.41 with Pearson correlation of 0.86, with competitive performance on dimension-wise tasks. Beyond these quantitative gains, it offers a flexible natural-language interface in which the language model acts as an audio quality expert: practitioners can query arbitrary aspects of degradations, prompt the model to emulate different listener profiles to capture human variability and produce diverse but plausible judgments rather than a single deterministic score, and thereby reduce reliance on large-scale crowdsourced tests and their monetary cost.
Primary: UNC Chapel Hill
All Institutions: UNC Chapel Hill
The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
The methodology presented in the paper is innovative as it combines audio encoding with a language model to create a multimodal QA system for speech quality assessment. The use of template-based question-answer pairs to train the model on the NISQA corpus is a thoughtful approach that allows for flexibility in querying and understanding audio quality metrics. The model's ability to generate textual answers rather than fixed scores enhances interpretability and user interaction, which is a significant advancement over traditional methods.
The experiments conducted on held-out NISQA clips demonstrate solid performance, with a mean absolute error of 0.41 and a Pearson correlation of 0.86, indicating a strong correlation with human ratings. The competitive performance on dimension-wise tasks further validates the effectiveness of the proposed system. However, the paper could benefit from a more extensive evaluation across different datasets and real-world scenarios to fully establish its robustness.
The authors provide a GitHub repository with code, model weights, and experimental results, which is a positive aspect for reproducibility. However, the paper could enhance reproducibility by including more detailed descriptions of hyperparameters, training procedures, and evaluation metrics used in the experiments.
One limitation is the reliance on the NISQA corpus, which may not encompass all possible audio quality scenarios encountered in real-world applications. Additionally, while the model offers flexibility in querying, it may still struggle with edge cases or highly subjective audio quality assessments that require nuanced human judgment.
The potential applications of SpeechQualityLLM are significant, particularly in telephony, VoIP, and streaming services, where audio quality is paramount. By reducing reliance on costly human evaluations and enabling interactive queries, this system could streamline quality assessment processes and improve user experiences across various platforms. The main contribution of this paper is the introduction of SpeechQualityLLM, a multimodal speech quality assessment system that leverages a language model to provide interactive, interpretable evaluations of audio quality. This work represents a significant advancement in the field by integrating natural language processing with audio quality assessment, offering a novel approach that enhances both flexibility and user engagement in evaluating speech quality.
Real-time speech communication over wireless networks remains challenging, as conventional channel protection mechanisms cannot effectively counter packet loss under stringent bandwidth and latency constraints. Semantic communication has emerged as a promising paradigm for enhancing the robustness of speech transmission by means of joint source-channel coding (JSCC). However, its cross-layer design hinders practical deployment due to the incompatibility with existing digital communication systems. In this case, the robustness of speech communication is consequently evaluated primarily by the error-resilience to packet loss over wireless networks. To address these challenges, we propose \emph{Glaris}, a generative latent-prior-based resilient speech semantic communication framework that performs resilient speech coding in the generative latent space. Generative latent priors enable high-quality packet loss concealment (PLC) at the receiver side, well-balancing semantic consistency and reconstruction fidelity. Additionally, an integrated error resilience mechanism is designed to mitigate the error propagation and improve the effectiveness of PLC. Compared with traditional packet-level forward error correction (FEC) strategies, our new method achieves enhanced robustness over dynamic wireless networks while reducing redundancy overhead significantly. Experimental results on the LibriSpeech dataset demonstrate that \emph{Glaris} consistently outperforms existing error-resilient codecs, achieving JSCC-level robustness while maintaining seamless compatibility with existing systems, and it also strikes a favorable balance between transmission efficiency and speech reconstruction quality.
Primary: Guilin University of Electronic Technology
All Institutions: Guilin University of Electronic Technology, Beijing University of Posts and Telecommunications
The main contribution of this paper is the introduction of Glaris, a generative latent-prior-based framework for resilient speech communication that effectively balances semantic consistency and reconstruction fidelity in the presence of packet loss. The comprehensive methodology and experimental validation demonstrate its potential to advance the field of error-resilient communication systems.
The paper introduces Glaris, a novel framework for error-resilient semantic communication in speech transmission over packet-loss networks. It employs a two-stage coding architecture utilizing generative latent priors to enhance both semantic consistency and reconstruction fidelity. The integration of in-band forward error correction (FEC) with packet loss concealment (PLC) is a significant methodological advancement, as it allows for adaptive redundancy control and improved resilience against dynamic channel conditions. The use of a VQ-VAE for encoding high-dimensional speech into a compact latent representation is well-justified and effectively addresses the challenges of maintaining quality under packet loss.
The experimental evaluation is thorough, utilizing the LibriSpeech dataset under various packet-loss conditions, including both simulated and real-world scenarios. The results demonstrate that Glaris outperforms existing codecs in terms of robustness and efficiency. The use of multiple metrics (PESQ, STOI, WER, and MOS) provides a comprehensive assessment of performance, and the subjective listening tests further validate the framework's effectiveness. However, the paper could benefit from a more detailed comparison with additional state-of-the-art methods to strengthen the claims of superiority.
The paper provides a detailed methodology and experimental setup, including training parameters and loss functions, which enhances reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for others to replicate the results directly. Including such resources would significantly improve the paper's reproducibility.
One limitation is the potential trade-off between redundancy and compression efficiency, particularly in low-bitrate scenarios where excessive redundancy may degrade quality. Additionally, while the framework shows promise, its performance in highly variable real-world conditions remains to be fully validated. The reliance on specific datasets may also limit the generalizability of the findings.
The proposed framework has significant implications for real-time speech communication applications, particularly in scenarios where packet loss is common, such as VoIP and online meetings. By improving the robustness and efficiency of speech transmission, Glaris could enhance user experience in various latency-sensitive applications. The integration of semantic communication principles may also pave the way for future advancements in other modalities, such as video and text transmission. The main contribution of this paper is the introduction of Glaris, a generative latent-prior-based framework for resilient speech communication that effectively balances semantic consistency and reconstruction fidelity in the presence of packet loss. The comprehensive methodology and experimental validation demonstrate its potential to advance the field of error-resilient communication systems.
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.
Primary: New York University
All Institutions: New York University, University of California, Berkeley, University of Southern California, Stanford University, Facebook AI Research
The paper presents a novel two-stage self-supervised framework that integrates JEPA with DAAM for efficient speech representation learning, showcasing significant advancements in methodology and practical applications. The technical contributions are well-articulated, and the results demonstrate the potential for impactful applications in the field of audio processing.
The proposed methodology introduces a two-stage self-supervised learning framework that effectively decouples representation learning from reconstruction, which is a significant advancement in the field of speech representation. By employing the Joint-Embedding Predictive Architecture (JEPA) in conjunction with a Density Adaptive Attention Mechanism (DAAM), the authors demonstrate a novel approach to learning semantic audio features through masked prediction. The integration of Gaussian mixture-based gating enhances the model's ability to adaptively select informative features, which is a notable contribution to the efficiency and effectiveness of speech representation learning.
The experimental setup is robust, utilizing a large-scale dataset (LibriLight) for training, which is crucial for self-supervised learning tasks. The paper outlines a clear evaluation strategy, comparing the proposed method against established baselines, including a JEPA baseline without DAAM and WavLM-Large. The results indicate that the proposed approach not only achieves competitive performance but also excels in efficiency, providing a strong case for its practical applicability in real-world scenarios.
The authors provide a comprehensive description of the implementation details, including hyperparameters, training procedures, and the architecture of the model. The availability of the complete implementation on GitHub enhances reproducibility, allowing other researchers to validate and build upon their work. However, the paper could benefit from additional details on the training environment and computational resources used.
The paper acknowledges several limitations, including a fixed masking strategy that may not adapt optimally to varying speech rates and a focus on English speech, which limits generalizability. Furthermore, the authors note that the scale of data used for pretraining is relatively modest compared to other self-supervised systems, which may affect the robustness of their findings.
The implications of this research are significant, particularly in the context of speech processing applications such as automatic speech recognition, text-to-speech systems, and voice conversion. The ability to learn efficient and robust speech representations has the potential to enhance the performance of various downstream tasks and could lead to advancements in human-computer interaction technologies. The paper presents a novel two-stage self-supervised framework that integrates JEPA with DAAM for efficient speech representation learning, showcasing significant advancements in methodology and practical applications. The technical contributions are well-articulated, and the results demonstrate the potential for impactful applications in the field of audio processing.
Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Based on this dataset, we define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. We further propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have released.
Primary: Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems
All Institutions: Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Digital Innovation Research Center
The paper presents a comprehensive approach to enhancing speech anti-spoofing detection through the introduction of a diverse dataset and a novel model architecture. This work is a meaningful contribution to the field, addressing real-world challenges and setting a foundation for future research in audio security.
The paper introduces a novel dataset, MultiAPI Spoof, which significantly expands the scope of audio anti-spoofing research by incorporating 230 hours of synthetic speech from 30 different APIs. This approach addresses the limitations of existing datasets that primarily focus on a narrow set of models, thereby enhancing the realism and applicability of anti-spoofing detection methods. The proposed Nes2Net-LA model, which integrates local attention mechanisms into the existing Nes2Net architecture, is a meaningful advancement that improves local context modeling and feature extraction. The methodology is well-structured, with clear definitions of tasks and systematic enhancements to existing models.
The experiments conducted are comprehensive, utilizing multiple datasets to evaluate the performance of the proposed methods. The results demonstrate significant improvements in anti-spoofing detection and API tracing tasks, particularly in unseen scenarios. The paper provides detailed metrics, including EER, minDCF, and actDCF, which are critical for evaluating the effectiveness of anti-spoofing systems. However, the reliance on a single language (English) and the absence of real-world data in the testing phase could limit the generalizability of the findings.
The authors have made their code and dataset publicly available, which is a positive aspect for reproducibility. However, the paper could benefit from more detailed descriptions of the experimental setup, including hyperparameter settings and model training specifics, to facilitate easier replication by other researchers.
One limitation is the focus on English audio, which may restrict the applicability of the findings to other languages and dialects. Additionally, while the local attention mechanism shows promise, the paper acknowledges challenges in generalizing to unseen APIs, indicating that further research is needed to improve model robustness in this area.
The implications of this research are significant, as it addresses a critical security concern in the growing field of synthetic speech generation. By improving anti-spoofing detection methods, the work contributes to the development of more secure voice authentication systems, which are increasingly important in various applications, including finance, telecommunications, and personal security. The paper presents a comprehensive approach to enhancing speech anti-spoofing detection through the introduction of a diverse dataset and a novel model architecture. This work is a meaningful contribution to the field, addressing real-world challenges and setting a foundation for future research in audio security.
Single-channel audio separation aims to separate individual sources from a single-channel mixture. Most existing methods rely on supervised learning with synthetically generated paired data. However, obtaining high-quality paired data in real-world scenarios is often difficult. This data scarcity can degrade model performance under unseen conditions and limit generalization ability. To this end, in this work, we approach this problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only diffusion priors trained on individual sources. Separation is then achieved by iteratively guiding an initial state toward the solution through reconstruction guidance. Importantly, we introduce an advanced inverse problem solver specifically designed for separation, which mitigates gradient conflicts caused by interference between the diffusion prior and reconstruction guidance during inverse denoising. This design ensures high-quality and balanced separation performance across individual sources. Additionally, we find that initializing the denoising process with an augmented mixture instead of pure Gaussian noise provides an informative starting point that significantly improves the final performance. To further enhance audio prior modeling, we design a novel time-frequency attention-based network architecture that demonstrates strong audio modeling capability. Collectively, these improvements lead to significant performance gains, as validated across speech-sound event, sound event, and speech separation tasks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of an unsupervised framework for single-channel audio separation using diffusion models, which significantly improves separation quality while addressing the challenges posed by data scarcity. The comprehensive methodology and strong experimental validation position this work as a notable advancement in the field of audio processing.
The paper introduces an innovative unsupervised approach to single-channel audio separation, leveraging diffusion models as a probabilistic inverse problem. The methodology is well-structured, addressing the limitations of traditional supervised methods by utilizing unpaired data and introducing a hybrid gradient guidance schedule to mitigate gradient conflicts. The use of a noise-augmented mixture initialization is particularly noteworthy, as it provides a more informative starting point for the separation process. The design of the time-frequency attention-based network architecture is also a significant enhancement, allowing for better modeling of audio data.
The experiments are comprehensive, covering multiple audio separation tasks and utilizing relevant datasets such as VCTK and FSD-Kaggle2018. The results demonstrate significant performance improvements over existing methods, including both supervised and unsupervised approaches. The paper effectively compares its method against strong baselines, providing clear metrics (SI-SDR, PESQ) that validate the proposed approach's efficacy. The ablation studies further strengthen the findings by isolating the contributions of various components of the proposed methodology.
The paper provides sufficient implementation details, including model architecture, training configurations, and hyperparameters, which facilitate reproducibility. However, the lack of a publicly available code repository limits the ease of reproduction for other researchers. The detailed description of the experimental setup and datasets used enhances the reproducibility of the results.
While the proposed method shows promising results, it may still struggle with highly overlapping sources, particularly in the case of speech separation where speaker identity consistency is critical. The reliance on diffusion models may also introduce computational overhead, which could be a concern for real-time applications. Additionally, the paper does not address the scalability of the approach to larger datasets or more complex audio environments.
This research has significant implications for various applications in audio processing, including music separation, speech enhancement, and sound event detection. The unsupervised nature of the proposed method makes it particularly valuable in scenarios where labeled data is scarce or difficult to obtain. By improving the quality of audio separation, the work could enhance user experiences in consumer audio applications, assistive technologies, and content creation. The main contribution of this paper is the introduction of an unsupervised framework for single-channel audio separation using diffusion models, which significantly improves separation quality while addressing the challenges posed by data scarcity. The comprehensive methodology and strong experimental validation position this work as a notable advancement in the field of audio processing.