Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
Primary: Florida State University
All Institutions: Florida State University, University of Tennessee at Chattanooga
The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
The paper presents an innovative unsupervised approach to eavesdropping on in-person conversations using mmWave sensing technology. The methodology is well-structured, addressing significant challenges such as low resolution of vibration signals and interference from static objects. The authors propose a multi-module system that includes a speech-aware calibration scheme, a noise-robust signal processing pipeline, and a deep learning framework for signal enhancement. The unsupervised clustering method for speaker distinction is particularly noteworthy as it allows for effective speaker attribution without prior knowledge of the number of speakers or their identities.
The experimental validation is extensive, demonstrating the effectiveness of the proposed attack in various real-world scenarios, including different object materials and speaker arrangements. The reported success rates for speaker distinction (up to 0.99) and consistent signal enhancement across setups provide strong evidence of the method's robustness. However, the reliance on synthetic datasets for training may raise questions about the generalizability of the results.
While the paper provides detailed descriptions of the experimental setup and methodology, there is a lack of publicly available code or datasets, which could hinder reproducibility. The authors mention using a specific radar system and a synthetic dataset, but without access to these resources, independent verification of results may be challenging.
The study does not address potential ethical concerns associated with the proposed eavesdropping technique. Additionally, while the method shows promise in controlled environments, its effectiveness in more complex, real-world scenarios with varying noise levels and object types remains uncertain. The performance may degrade significantly in less ideal conditions, which is not thoroughly explored in the paper.
The implications of this research are significant, as it highlights potential privacy risks associated with passive sensing technologies in shared environments. The ability to eavesdrop on sensitive conversations raises ethical concerns regarding surveillance and data privacy. This work could inform future regulations and security measures to protect against such vulnerabilities in various settings, including corporate and healthcare environments. The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
The methodology presented in this paper is innovative, combining mobile room impulse response (RIR) capture with a visual-assisted acoustic field model. The use of commodity smartphones for constructing audio-visual digital twins is a significant advancement, as it democratizes access to advanced acoustic modeling techniques. The differentiable acoustic rendering for recovering surface material properties is a notable technical contribution, allowing for real-time modifications and updates to both audio and visual components. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their computational efficiency.
The experimental evaluation is thorough, showcasing the effectiveness of the AV-Twin system in various scenarios. The authors provide quantitative metrics for the accuracy of acoustic reconstructions and the fidelity of the visual outputs. However, the datasets used for evaluation are not extensively described, which raises questions about the generalizability of the results. More diverse environments and material types could enhance the robustness of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the authors mention the use of smartphones, they do not provide specifics on the hardware or software configurations used in their experiments. Additionally, the absence of a public code repository or demo URL limits the ability of other researchers to validate the findings independently.
One limitation of the study is the reliance on commodity smartphones, which may introduce variability in the quality of the captured data. Furthermore, the system's performance may be constrained by the physical limitations of the devices used, such as microphone sensitivity and processing power. The paper also does not address potential challenges in real-world applications, such as varying environmental conditions and user expertise.
The potential applications of AV-Twin are vast, ranging from virtual reality environments to architectural design and acoustic engineering. By enabling users to create and modify audio-visual digital twins easily, this work could significantly enhance user interaction and experience in various fields. The approach could also inspire further research into integrating acoustics with other sensory modalities in digital twin technologies. The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Primary: Peking University
All Institutions: Peking University, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
The paper introduces BRACE, a benchmark specifically designed for evaluating audio captioning metrics in a reference-free setting. It comprises two sub-benchmarks, BRACE-Main and BRACE-Hallucination, which assess fine-grained caption comparisons and hallucination detection, respectively. The methodology is robust, utilizing a combination of high-quality filtering, LLM-based corruption, and human annotation to construct datasets. The dual focus on both the quality of audio-caption alignment and the detection of hallucinations presents a comprehensive approach to addressing existing gaps in audio caption evaluation metrics. The use of diverse models and evaluation strategies enhances the credibility of the findings.
The experiments conducted on BRACE benchmark reveal significant insights into the performance of CLAP-based ACEMs and LALMs. The results indicate that even the best-performing models struggle to achieve high scores, highlighting the challenges in audio caption evaluation. The evaluation metrics are well-defined, and the performance of various models is systematically compared, providing a clear understanding of their limitations. The rigorous testing across different model architectures adds depth to the experimental evaluation.
The authors have taken steps to ensure reproducibility by providing access to the evaluation code and benchmark datasets. Detailed descriptions of the experimental configurations, including model settings and evaluation strategies, are included. However, the paper could benefit from more explicit instructions on how to replicate the experiments, particularly regarding the specific prompts and configurations used in LALM evaluations.
The paper acknowledges certain limitations, particularly regarding the performance of existing models on the benchmark. However, it could further elaborate on potential biases in the dataset construction process and the implications of using LLMs for generating and corrupting captions. Additionally, the computational constraints faced during experiments limit the ability to conduct extensive evaluations, which could affect the generalizability of the results.
The development of BRACE has significant implications for the field of audio understanding, particularly in enhancing accessibility and content indexing. By providing a reliable benchmark for evaluating audio captioning metrics, it can drive improvements in model development and evaluation practices. However, the potential for misuse of audio captioning technologies, such as generating misleading or inaccurate captions, should be considered, and appropriate safeguards should be discussed. The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
Detecting partial deepfake speech is challenging because manipulations occur only in short regions while the surrounding audio remains authentic. However, existing detection methods are fundamentally limited by the quality of available datasets, many of which rely on outdated synthesis systems and generation procedures that introduce dataset-specific artifacts rather than realistic manipulation cues. To address this gap, we introduce HQ-MPSD, a high-quality multilingual partial deepfake speech dataset. HQ-MPSD is constructed using linguistically coherent splice points derived from fine-grained forced alignment, preserving prosodic and semantic continuity and minimizing audible and visual boundary artifacts. The dataset contains 350.8 hours of speech across eight languages and 550 speakers, with background effects added to better reflect real-world acoustic conditions. MOS evaluations and spectrogram analysis confirm the high perceptual naturalness of the samples. We benchmark state-of-the-art detection models through cross-language and cross-dataset evaluations, and all models experience performance drops exceeding 80% on HQ-MPSD. These results demonstrate that HQ-MPSD exposes significant generalization challenges once low-level artifacts are removed and multilingual and acoustic diversity are introduced, providing a more realistic and demanding benchmark for partial deepfake detection. The dataset can be found at: https://zenodo.org/records/17929533.
Primary: Tsinghua University
All Institutions: Shenzhen Key Laboratory of Ubiquitous Data Enabling, Tsinghua University, Department of Electrical, Computer & Biomedical Engineering, Toronto Metropolitan University
The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
The methodology for constructing the HQ-MPSD dataset is robust and innovative. It employs a three-stage process for generating partial deepfake speech that emphasizes linguistic coherence and acoustic fidelity. The use of fine-grained forced alignment for splice points and the normalization of loudness and spectral characteristics are noteworthy techniques that enhance the quality of the dataset. Additionally, the incorporation of background effects to simulate real-world conditions is a significant improvement over existing datasets. The careful design choices made to minimize artifacts and ensure natural transitions contribute to the dataset's overall quality and applicability for training detection models.
The experiments conducted using HQ-MPSD are comprehensive and well-structured. The cross-language and cross-dataset evaluations provide valuable insights into the generalization capabilities of state-of-the-art detection models. The performance drop observed in existing models when tested on HQ-MPSD highlights the dataset's effectiveness in revealing the limitations of current methodologies. The use of metrics such as Equal Error Rate (EER) and Area Under the Curve (AUC) for evaluation is appropriate and provides a clear understanding of model performance.
The paper provides sufficient detail regarding the dataset generation process and experimental setup, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The dataset itself is accessible, which is a positive aspect for researchers looking to build upon this work.
While the dataset is a significant advancement, it may still have limitations regarding the diversity of accents and dialects within the eight languages represented. Additionally, the reliance on forced alignment may introduce its own biases, particularly if the alignment tools are not perfectly accurate. The paper does not address potential ethical concerns related to the misuse of deepfake technology, which is an important consideration in this field.
The development of HQ-MPSD has the potential to significantly advance the field of deepfake detection by providing a high-quality, multilingual benchmark that can improve the robustness of detection models. The dataset's design encourages the exploration of genuine manipulation cues rather than superficial artifacts, which can lead to more effective solutions in real-world applications. This work is particularly relevant in the context of misinformation and security, where the ability to detect partial deepfake speech can have substantial societal implications. The paper introduces HQ-MPSD, a high-quality multilingual dataset for partial deepfake speech detection, addressing critical gaps in existing datasets and providing a rigorous benchmark for evaluating detection models. The comprehensive methodology and experimental evaluations demonstrate significant contributions to the field, paving the way for advancements in robust detection systems.
Recent codec-based language models~(LMs) have revolutionized text-to-speech~(TTS). However, since standard codecs tightly couple timbre and prosody, continuation-based LMs inevitably replicate this entanglement, hindering independent control. Recent efforts attempt to break this entanglement via codec design, but insufficient decoupling remains a critical bottleneck. To tackle this challenge, we propose DisCo-Speech, a zero-shot controllable TTS framework that enables prosody control and voice cloning via a disentangled speech codec (DisCodec) and an LM-based generator. The core component, DisCodec, contains two core stages: 1) Tri-factor disentanglement, which explicitly factorizes speech into content, prosody, and timbre subspaces via parallel encoders and hybrid losses; and 2) Fusion and reconstruction, which fuses content and prosody into unified content-prosody tokens suitable for LM prediction, while jointly optimizing reconstruction quality to resolve the disentanglement-reconstruction trade-off. With this design, the LM performs prosodic continuation from a style prompt while the decoder handles target timbre injection, enabling flexible zero-shot control. Experiments show that DisCo-Speech matches state-of-the-art voice cloning performance while outperforming baselines in zero-shot prosody control. By resolving the core entanglement at the codec level, DisCo-Speech provides a robust foundation for controllable speech synthesis. Audio samples are available at https://github.com/disco-speech/DisCo-Speech, and the code and weights will be released at the same link.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
The proposed methodology of DisCo-Speech is innovative, focusing on disentangling speech attributes into content, prosody, and timbre through a two-stage training paradigm. The tri-factor disentanglement approach is a significant advancement over existing methods, allowing for independent control over speech generation. The use of hybrid losses and parallel encoders is well-justified, addressing the disentanglement-reconstruction trade-off effectively. The integration of a standard LM for prosodic continuation and a specialized decoder for waveform synthesis is a thoughtful design choice that enhances the flexibility of the system.
The experimental evaluation is thorough, utilizing a diverse dataset and comparing DisCo-Speech against state-of-the-art models. The results demonstrate competitive performance in voice cloning and prosody control, with clear metrics provided for reconstruction quality and controllability. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings. However, more extensive comparisons with a broader range of existing methods could provide deeper insights into its relative performance.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. The authors also mention plans to release code and weights, which is essential for enabling other researchers to validate the findings and build upon the work. However, the absence of specific details about the training data and preprocessing steps could hinder full reproducibility.
The paper acknowledges limitations, including lower speaker similarity compared to multi-stage systems and potential instability in generating exaggerated prosody. The delicate balance between disentanglement and reconstruction fidelity is also highlighted as an ongoing challenge. These limitations suggest areas for future improvement, particularly in enhancing the expressive range and fidelity of the generated speech.
The advancements presented in DisCo-Speech have significant implications for applications in human-computer interaction, entertainment, and accessibility technologies. The ability to generate speech with controlled prosody and timbre could enhance user experience in virtual assistants, audiobooks, and language learning tools. Furthermore, the framework's potential for zero-shot learning could democratize access to high-quality speech synthesis across diverse languages and dialects. The main contribution of this paper is the introduction of DisCo-Speech, a novel framework for zero-shot controllable speech generation that achieves independent control over speaker timbre and speaking prosody through a disentangled speech codec. This work represents a significant step forward in the field of text-to-speech synthesis, addressing critical challenges in disentanglement and control, and providing a robust foundation for future research and applications.
This paper proposes a unified framework, All-in-One ASR, that allows a single model to support multiple automatic speech recognition (ASR) paradigms, including connectionist temporal classification (CTC), attention-based encoder-decoder (AED), and Transducer, in both offline and streaming modes. While each ASR architecture offers distinct advantages and trade-offs depending on the application, maintaining separate models for each scenario incurs substantial development and deployment costs. To address this issue, we introduce a multi-mode joiner that enables seamless integration of various ASR modes within a single unified model. Experiments show that All-in-One ASR significantly reduces the total model footprint while matching or even surpassing the recognition performance of individually optimized ASR models. Furthermore, joint decoding leverages the complementary strengths of different ASR modes, yielding additional improvements in recognition accuracy.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
The proposed All-in-One ASR framework introduces a multi-mode joiner that effectively integrates CTC, AED, and Transducer models into a single architecture. This unification is significant as it reduces the model footprint and computational overhead while maintaining or improving recognition performance. The methodology is well-structured, leveraging joint training and decoding strategies to exploit the strengths of different ASR paradigms without the need for separate decoder branches. The use of a shared encoder and the innovative joiner mechanism are noteworthy contributions that address the challenges of model complexity and resource efficiency in ASR systems.
The experimental evaluation is robust, utilizing well-established datasets such as TED-LIUM and LibriSpeech to demonstrate the effectiveness of the All-in-One ASR framework. The results indicate that the proposed model not only matches but often surpasses the performance of individually optimized models across various ASR tasks. The paper provides detailed comparisons and ablation studies that substantiate the claims of improved performance and reduced model size, showcasing the framework's versatility in both offline and streaming modes.
While the paper outlines the architecture and training procedures in detail, it lacks specific URLs or repositories for code and datasets, which could hinder reproducibility. The absence of a public demo or project page further limits the ability of other researchers to replicate the results. However, the comprehensive description of the methodologies and experimental setups provides a solid foundation for future implementations.
One limitation is the potential complexity introduced by the multi-mode joiner, which may require careful tuning of hyperparameters to achieve optimal performance across different ASR tasks. Additionally, the paper does not address the implications of scaling this framework to more complex or diverse ASR tasks beyond those tested. The reliance on specific datasets may also limit the generalizability of the findings.
The All-in-One ASR framework has significant implications for the deployment of ASR systems in resource-constrained environments, such as mobile devices or embedded systems, where model size and computational efficiency are critical. By unifying multiple ASR paradigms, this approach could streamline the development process and reduce costs, making advanced speech recognition technology more accessible across various applications. The paper presents a novel framework that unifies multiple ASR paradigms into a single model, significantly reducing complexity and enhancing performance. The comprehensive methodology and rigorous experimental validation highlight its potential to advance the state of the art in automatic speech recognition.
This technical report presents a new paradigm for full-song symbolic music generation. Existing symbolic models operate on note-attribute tokens and suffer from extremely long sequences, limited context length, and weak support for long-range structure. We address these issues by introducing PhraseVAE and PhraseLDM, the first latent diffusion framework designed for full-song multitrack symbolic music. PhraseVAE compresses variable-length polyphonic note sequences into compact 64-dimensional phrase-level representations with high reconstruction fidelity, allowing efficient training and a well-structured latent space. Built on this latent space, PhraseLDM generates an entire multi-track song in a single pass without any autoregressive components. The system eliminates bar-wise sequential modeling, supports up to 128 bars of music (8 minutes in 64 bpm), and produces complete songs with coherent local texture, idiomatic instrument patterns, and clear global structure. With only 45M parameters, our framework generates a full song within seconds while maintaining competitive musical quality and generation diversity. Together, these results show that phrase-level latent diffusion provides an effective and scalable solution to long-sequence modeling in symbolic music generation. We hope this work encourages future symbolic music research to move beyond note-attribute tokens and to consider phrase-level units as a more effective and musically meaningful modeling target.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
The paper introduces PhraseVAE and PhraseLDM, which leverage latent diffusion for symbolic music generation. The methodology is innovative as it compresses polyphonic note sequences into a structured latent space, allowing for efficient training and generation. The use of phrase-level representations instead of note-attribute tokens is a significant shift that addresses limitations in existing models. However, the details on the training process and the specific architecture of the latent diffusion model could be elaborated further to enhance understanding.
The experiments demonstrate the framework's ability to generate full songs with coherent structure and idiomatic instrument patterns. The evaluation metrics used to assess musical quality and generation diversity are not explicitly detailed, which could limit the assessment of the model's performance. The ability to generate 128 bars of music in a single pass is a notable achievement, indicating a strong technical contribution.
The paper does not provide sufficient details on the implementation or datasets used for training and evaluation, which raises concerns about reproducibility. Including a code repository or supplementary materials would greatly enhance the reproducibility of the results.
One limitation is the lack of detailed evaluation metrics and comparisons with existing state-of-the-art models. Additionally, while the model can generate music quickly, the paper does not discuss potential challenges in ensuring the musicality and creativity of the generated pieces over longer sequences.
The proposed framework has the potential to significantly advance the field of symbolic music generation, encouraging researchers to explore phrase-level modeling. This could lead to more sophisticated music generation systems that better capture the nuances of musical composition. The approach may also inspire applications in interactive music systems and automated composition tools. The main contribution of this work is the introduction of a novel latent diffusion framework for full-song multitrack symbolic music generation, which addresses significant limitations in existing models. The methodology and results indicate a promising direction for future research in symbolic music generation, although improvements in reproducibility and evaluation metrics are necessary for broader adoption and validation in the field.
Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free settings where high-quality ground-truth captions are unavailable. While CLAPScore is currently the most widely used reference-free Audio Caption Evaluation Metric(ACEM), its robustness under diverse conditions has not been systematically validated. To address this gap, we introduce BRACE, a new benchmark designed to evaluate audio caption alignment quality in a reference-free setting. BRACE is primarily designed for assessing ACEMs, and can also be extended to measure the modality alignment abilities of Large Audio Language Model(LALM). BRACE consists of two sub-benchmarks: BRACE-Main for fine-grained caption comparison and BRACE-Hallucination for detecting subtle hallucinated content. We construct these datasets through high-quality filtering, LLM-based corruption, and human annotation. Given the widespread adoption of CLAPScore as a reference-free ACEM and the increasing application of LALMs in audio-language tasks, we evaluate both approaches using the BRACE benchmark, testing CLAPScore across various CLAP model variants and assessing multiple LALMs. Notably, even the best-performing CLAP-based ACEM achieves only a 70.01 F1-score on the BRACE-Main benchmark, while the best LALM reaches just 63.19. By revealing the limitations of CLAP models and LALMs, our BRACE benchmark offers valuable insights into the direction of future research.
Primary: Peking University
All Institutions: Peking University, University of Chinese Academy of Sciences
The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
The paper introduces BRACE, a benchmark specifically designed for evaluating audio captioning metrics in a reference-free setting. It comprises two sub-benchmarks, BRACE-Main and BRACE-Hallucination, which assess fine-grained caption comparisons and hallucination detection, respectively. The methodology is robust, utilizing a combination of high-quality filtering, LLM-based corruption, and human annotation to construct datasets. The dual focus on both the quality of audio-caption alignment and the detection of hallucinations presents a comprehensive approach to addressing existing gaps in audio caption evaluation metrics. The use of diverse models and evaluation strategies enhances the credibility of the findings.
The experiments conducted on BRACE benchmark reveal significant insights into the performance of CLAP-based ACEMs and LALMs. The results indicate that even the best-performing models struggle to achieve high scores, highlighting the challenges in audio caption evaluation. The evaluation metrics are well-defined, and the performance of various models is systematically compared, providing a clear understanding of their limitations. The rigorous testing across different model architectures adds depth to the experimental evaluation.
The authors have taken steps to ensure reproducibility by providing access to the evaluation code and benchmark datasets. Detailed descriptions of the experimental configurations, including model settings and evaluation strategies, are included. However, the paper could benefit from more explicit instructions on how to replicate the experiments, particularly regarding the specific prompts and configurations used in LALM evaluations.
The paper acknowledges certain limitations, particularly regarding the performance of existing models on the benchmark. However, it could further elaborate on potential biases in the dataset construction process and the implications of using LLMs for generating and corrupting captions. Additionally, the computational constraints faced during experiments limit the ability to conduct extensive evaluations, which could affect the generalizability of the results.
The development of BRACE has significant implications for the field of audio understanding, particularly in enhancing accessibility and content indexing. By providing a reliable benchmark for evaluating audio captioning metrics, it can drive improvements in model development and evaluation practices. However, the potential for misuse of audio captioning technologies, such as generating misleading or inaccurate captions, should be considered, and appropriate safeguards should be discussed. The main contribution of this paper is the introduction of BRACE, a benchmark for evaluating audio caption quality in a reference-free setting, which addresses critical gaps in the assessment of audio captioning metrics. This work significantly advances the field by providing a structured approach to evaluate model performance and identify areas for improvement, thereby fostering future research in audio-language understanding.
Digital twins today are almost entirely visual, overlooking acoustics-a core component of spatial realism and interaction. We introduce AV-Twin, the first practical system that constructs editable audio-visual digital twins using only commodity smartphones. AV-Twin combines mobile RIR capture and a visual-assisted acoustic field model to efficiently reconstruct room acoustics. It further recovers per-surface material properties through differentiable acoustic rendering, enabling users to modify materials, geometry, and layout while automatically updating both audio and visuals. Together, these capabilities establish a practical path toward fully modifiable audio-visual digital twins for real-world environments.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania
The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
The methodology presented in this paper is innovative, combining mobile room impulse response (RIR) capture with a visual-assisted acoustic field model. The use of commodity smartphones for constructing audio-visual digital twins is a significant advancement, as it democratizes access to advanced acoustic modeling techniques. The differentiable acoustic rendering for recovering surface material properties is a notable technical contribution, allowing for real-time modifications and updates to both audio and visual components. However, the paper could benefit from a more detailed explanation of the underlying algorithms and their computational efficiency.
The experimental evaluation is thorough, showcasing the effectiveness of the AV-Twin system in various scenarios. The authors provide quantitative metrics for the accuracy of acoustic reconstructions and the fidelity of the visual outputs. However, the datasets used for evaluation are not extensively described, which raises questions about the generalizability of the results. More diverse environments and material types could enhance the robustness of the findings.
The paper lacks sufficient implementation details that would allow for easy reproduction of the results. While the authors mention the use of smartphones, they do not provide specifics on the hardware or software configurations used in their experiments. Additionally, the absence of a public code repository or demo URL limits the ability of other researchers to validate the findings independently.
One limitation of the study is the reliance on commodity smartphones, which may introduce variability in the quality of the captured data. Furthermore, the system's performance may be constrained by the physical limitations of the devices used, such as microphone sensitivity and processing power. The paper also does not address potential challenges in real-world applications, such as varying environmental conditions and user expertise.
The potential applications of AV-Twin are vast, ranging from virtual reality environments to architectural design and acoustic engineering. By enabling users to create and modify audio-visual digital twins easily, this work could significantly enhance user interaction and experience in various fields. The approach could also inspire further research into integrating acoustics with other sensory modalities in digital twin technologies. The main contribution of this paper is the introduction of AV-Twin, a system that allows for the creation of editable audio-visual digital twins using smartphones, combining innovative acoustic modeling with user-friendly interfaces. This work represents a significant step forward in the integration of audio and visual data for realistic digital environments, with implications for multiple industries.
A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo; Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
The methodology presented in MR-FlowDPO is innovative, leveraging Direct Preference Optimization (DPO) to align music generation with human preferences. The approach of using multiple musical rewards to evaluate text alignment, audio production quality, and semantic consistency is well-structured. The integration of scalable off-the-shelf models for reward prediction is a practical choice that enhances the model's applicability. However, the paper could benefit from a more detailed explanation of the scoring mechanism and how it specifically improves rhythmic stability.
The experiments conducted are extensive, utilizing both objective metrics and human evaluations to assess the effectiveness of the proposed model. The results indicate a significant improvement over competitive baselines, which strengthens the claims made in the paper. However, the paper lacks a detailed description of the datasets used, which is crucial for understanding the generalizability of the findings.
The authors provide links to their code and demo page, which is a positive aspect for reproducibility. However, the paper does not sufficiently detail the experimental setup, including hyperparameters and training procedures, which may hinder full reproducibility by other researchers.
One limitation is the potential subjectivity in human evaluations, which can vary widely among individuals. Additionally, the reliance on off-the-shelf models for reward prediction may introduce biases based on the limitations of those models. The paper could also explore the scalability of the approach in real-world applications beyond the experimental settings.
The implications of this research are significant for the field of music generation, as it addresses the subjective nature of music evaluation and aims to create models that better align with human preferences. This could lead to more personalized music generation applications, enhancing user experience in various domains such as entertainment and therapy. The main contribution of this paper is the introduction of MR-FlowDPO, a novel framework that enhances flow-matching-based music generation through Direct Preference Optimization with multiple musical rewards, significantly improving alignment with human preferences. This work represents a meaningful advancement in music generation, combining innovative methodologies with practical applications, although it could benefit from clearer experimental details and a deeper exploration of limitations.
Automated audio captioning models frequently produce overconfident predictions regardless of semantic accuracy, limiting their reliability in deployment. This deficiency stems from two factors: evaluation metrics based on n-gram overlap that fail to capture semantic correctness, and the absence of calibrated confidence estimation. We present a framework that addresses both limitations by integrating confidence prediction into audio captioning and redefining correctness through semantic similarity. Our approach augments a Whisper-based audio captioning model with a learned confidence prediction head that estimates uncertainty from decoder hidden states. We employ CLAP audio-text embeddings and sentence transformer similarities (FENSE) to define semantic correctness, enabling Expected Calibration Error (ECE) computation that reflects true caption quality rather than surface-level text overlap. Experiments on Clotho v2 demonstrate that confidence-guided beam search with semantic evaluation achieves dramatically improved calibration (CLAP-based ECE of 0.071) compared to greedy decoding baselines (ECE of 0.488), while simultaneously improving caption quality across standard metrics. Our results establish that semantic similarity provides a more meaningful foundation for confidence calibration in audio captioning than traditional n-gram metrics.
Primary: Northeastern University
All Institutions: Northeastern University
The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
The paper introduces a novel framework for confidence calibration in automated audio captioning that integrates a learned confidence prediction head with a Whisper-based model. This approach is innovative as it shifts the focus from traditional n-gram overlap metrics to semantic similarity for evaluating correctness, which is a significant advancement in the field. The architecture is well-defined, with clear descriptions of the confidence prediction head, temperature scaling, and confidence-guided beam search. The methodology is robust and addresses existing limitations in audio captioning systems effectively.
The experiments conducted on the Clotho v2 dataset are comprehensive, demonstrating substantial improvements in both calibration and caption quality metrics. The results are compelling, with a dramatic reduction in Expected Calibration Error (ECE) from 0.488 to 0.071, showcasing the effectiveness of the proposed method. Additionally, the paper provides quantitative results across multiple evaluation metrics (BLEU, CIDEr, CLAP similarity), which strengthens the validity of the findings.
The implementation details are adequately described, including the model architecture, training parameters, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits reproducibility. Future work should consider making the code accessible to facilitate validation of results by the research community.
The paper acknowledges several limitations, including the somewhat arbitrary threshold for semantic correctness and the evaluation being limited to the Clotho dataset. The authors also note that the confidence head may not capture all sources of uncertainty, suggesting areas for future exploration. These limitations are important to consider for the generalization of the findings.
The proposed framework has significant implications for real-world applications of automated audio captioning, particularly in accessibility technologies and content indexing. By improving the reliability of predictions, this work could enhance user trust in automated systems, leading to broader adoption in various domains. The paper presents a framework for confidence-calibrated audio captioning that redefines correctness through semantic similarity. The contributions are significant, as they advance the state of the art in audio captioning by addressing overconfidence and improving the reliability of model predictions through innovative methodologies.
General-purpose audio representations aim to map acoustically variable instances of the same event to nearby points, resolving content identity in a zero-shot setting. Unlike supervised classification benchmarks that measure adaptability via parameter updates, we introduce VocSim, a training-free benchmark probing the intrinsic geometric alignment of frozen embeddings. VocSim aggregates 125k single-source clips from 19 corpora spanning human speech, animal vocalizations, and environmental sounds. By restricting to single-source audio, we isolate content representation from the confound of source separation. We evaluate embeddings using Precision@k for local purity and the Global Separation Rate (GSR) for point-wise class separation. To calibrate GSR, we report lift over an empirical permutation baseline. Across diverse foundation models, a simple pipeline, frozen Whisper encoder features, time-frequency pooling, and label-free PCA, yields strong zero-shot performance. However, VocSim also uncovers a consistent generalization gap. On blind, low-resource speech, local retrieval drops sharply. While performance remains statistically distinguishable from chance, the absolute geometric structure collapses, indicating a failure to generalize to unseen phonotactics. As external validation, our top embeddings predict avian perceptual similarity, improve bioacoustic classification, and achieve state-of-the-art results on the HEAR benchmark. We posit that the intrinsic geometric quality measured here proxies utility in unlisted downstream applications. We release data, code, and a public leaderboard to standardize the evaluation of intrinsic audio geometry.
Primary: Institute of Neuroinformatics, University of Zurich and ETH Zurich
All Institutions: Institute of Neuroinformatics, University of Zurich and ETH Zurich, Institute for the Interdisciplinary Study of Language Evolution, University of Zurich
The paper introduces VocSim, a training-free benchmark for evaluating zero-shot content identity in audio representations, significantly contributing to the field by providing a rigorous framework for assessing the intrinsic quality of audio embeddings. The comprehensive methodology and experimental validation enhance its relevance and potential impact on future research in audio processing and machine learning.
The authors present a novel benchmark, VocSim, designed to evaluate the intrinsic geometric alignment of audio embeddings in a zero-shot setting. The methodology is robust, employing a large dataset of 125k single-source audio clips aggregated from diverse corpora, which allows for rigorous testing of generalization capabilities. The use of training-free metrics, such as Precision@k and Global Separation Rate (GSR), is innovative and addresses the limitations of existing benchmarks that rely on supervised learning paradigms. The transductive PCA approach to mitigate anisotropy in embedding spaces is a thoughtful addition that enhances the evaluation process.
The experiments are comprehensive, evaluating multiple foundation models and providing a detailed analysis of their performance across different audio domains. The results reveal significant insights into the generalization capabilities of these models, particularly highlighting a notable performance drop on blind, low-resource speech datasets. The external validation of the embeddings' utility in predicting avian perceptual similarity and achieving state-of-the-art results on the HEAR benchmark further underscores the practical implications of the findings.
The paper includes a clear description of the experimental setup, data preprocessing, and evaluation metrics, which facilitates reproducibility. The authors have made their code and dataset publicly available, enhancing the transparency and accessibility of their research. However, the reliance on transductive PCA may complicate strict adherence to zero-shot evaluation protocols, which could be a point of contention for some researchers.
The primary limitation noted is the generalization gap observed in low-resource speech, indicating that current models may not generalize well across different phonotactics. Additionally, the benchmark's focus on single-source audio excludes polyphonic scenarios, which may limit its applicability in real-world contexts. The ethical considerations regarding data sovereignty and the handling of indigenous language data are also acknowledged, which is commendable but reflects the complexities involved in such research.
The implications of this research are significant, particularly in addressing biases in low-resource languages and the potential perpetuation of digital divides in audio processing technologies. By highlighting the performance disparities of state-of-the-art models on underrepresented languages, the authors advocate for more equitable advancements in machine learning applications. The benchmark also serves as a tool for future research to improve audio representation models, fostering advancements in bioacoustics and environmental sound classification. The paper introduces VocSim, a training-free benchmark for evaluating zero-shot content identity in audio representations, significantly contributing to the field by providing a rigorous framework for assessing the intrinsic quality of audio embeddings. The comprehensive methodology and experimental validation enhance its relevance and potential impact on future research in audio processing and machine learning.
Multi-participant meetings occur across various domains, such as business negotiations and medical consultations, during which sensitive information like trade secrets, business strategies, and patient conditions is often discussed. Previous research has demonstrated that attackers with mmWave radars outside the room can overhear meeting content by detecting minute speech-induced vibrations on objects. However, these eavesdropping attacks cannot differentiate which speech content comes from which person in a multi-participant meeting, leading to potential misunderstandings and poor decision-making. In this paper, we answer the question ``who speaks what''. By leveraging the spatial diversity introduced by ubiquitous objects, we propose an attack system that enables attackers to remotely eavesdrop on in-person conversations without requiring prior knowledge, such as identities, the number of participants, or seating arrangements. Since participants in in-person meetings are typically seated at different locations, their speech induces distinct vibration patterns on nearby objects. To exploit this, we design a noise-robust unsupervised approach for distinguishing participants by detecting speech-induced vibration differences in the frequency domain. Meanwhile, a deep learning-based framework is explored to combine signals from objects for speech quality enhancement. We validate the proof-of-concept attack on speech classification and signal enhancement through extensive experiments. The experimental results show that our attack can achieve the speech classification accuracy of up to $0.99$ with several participants in a meeting room. Meanwhile, our attack demonstrates consistent speech quality enhancement across all real-world scenarios, including different distances between the radar and the objects.
Primary: Florida State University
All Institutions: Florida State University, University of Tennessee at Chattanooga
The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
The paper presents an innovative unsupervised approach to eavesdropping on in-person conversations using mmWave sensing technology. The methodology is well-structured, addressing significant challenges such as low resolution of vibration signals and interference from static objects. The authors propose a multi-module system that includes a speech-aware calibration scheme, a noise-robust signal processing pipeline, and a deep learning framework for signal enhancement. The unsupervised clustering method for speaker distinction is particularly noteworthy as it allows for effective speaker attribution without prior knowledge of the number of speakers or their identities.
The experimental validation is extensive, demonstrating the effectiveness of the proposed attack in various real-world scenarios, including different object materials and speaker arrangements. The reported success rates for speaker distinction (up to 0.99) and consistent signal enhancement across setups provide strong evidence of the method's robustness. However, the reliance on synthetic datasets for training may raise questions about the generalizability of the results.
While the paper provides detailed descriptions of the experimental setup and methodology, there is a lack of publicly available code or datasets, which could hinder reproducibility. The authors mention using a specific radar system and a synthetic dataset, but without access to these resources, independent verification of results may be challenging.
The study does not address potential ethical concerns associated with the proposed eavesdropping technique. Additionally, while the method shows promise in controlled environments, its effectiveness in more complex, real-world scenarios with varying noise levels and object types remains uncertain. The performance may degrade significantly in less ideal conditions, which is not thoroughly explored in the paper.
The implications of this research are significant, as it highlights potential privacy risks associated with passive sensing technologies in shared environments. The ability to eavesdrop on sensitive conversations raises ethical concerns regarding surveillance and data privacy. This work could inform future regulations and security measures to protect against such vulnerabilities in various settings, including corporate and healthcare environments. The paper presents a novel attack that enables remote eavesdropping on in-person conversations via mmWave sensing. The comprehensive analysis of the technical contribution, methodology, and significance to the field underscores the potential for both innovative applications and ethical considerations in machine learning and privacy.
Controllable text-to-speech (TTS) systems face significant challenges in achieving independent manipulation of speaker timbre and speaking style, often suffering from entanglement between these attributes. We present DMP-TTS, a latent Diffusion Transformer (DiT) framework with explicit disentanglement and multi-modal prompting. A CLAP-based style encoder (Style-CLAP) aligns cues from reference audio and descriptive text in a shared space and is trained with contrastive learning plus multi-task supervision on style attributes. For fine-grained control during inference, we introduce chained classifier-free guidance (cCFG) trained with hierarchical condition dropout, enabling independent adjustment of content, timbre, and style guidance strengths. Additionally, we employ Representation Alignment (REPA) to distill acoustic-semantic features from a pretrained Whisper model into intermediate DiT representations, stabilizing training and accelerating convergence. Experiments show that DMP-TTS delivers stronger style controllability than open-source baselines while maintaining competitive intelligibility and naturalness. Code and demos will be available at https://y61329697.github.io/DMP-TTS/.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China
The paper presents DMP-TTS, a controllable TTS framework that effectively disentangles speaker timbre and speaking style through innovative methodologies. The contributions are significant, addressing key challenges in the field and providing a solid foundation for future research and applications in controllable speech synthesis.
The paper introduces DMP-TTS, a novel framework for controllable TTS that utilizes a latent Diffusion Transformer architecture, which is innovative in its approach to disentangling speaker timbre and speaking style. The use of a CLAP-based style encoder (Style-CLAP) for aligning audio and text cues is a significant methodological advancement. Additionally, the introduction of chained classifier-free guidance (cCFG) allows for independent control of multiple attributes, which is a notable improvement over existing methods. The methodology is well-structured, with clear explanations of the components and their interactions, although some technical details could benefit from further elaboration.
The experiments conducted are thorough, utilizing a high-quality dataset of approximately 300 hours of Chinese speech. The evaluation metrics are appropriate, including both objective measures (WER, speaker similarity) and subjective measures (NMOS, QMOS). The results demonstrate that DMP-TTS outperforms existing baselines in terms of style controllability while maintaining competitive intelligibility and naturalness. However, the paper could have benefited from a more detailed discussion of the statistical significance of the results.
The implementation details are provided, including the architecture, training procedure, and hyperparameters, which enhance reproducibility. The authors mention that code and demos will be available, which is essential for the community to validate and build upon their work. However, the absence of a direct link to a code repository limits immediate access to the implementation.
While the paper presents a strong framework, it does not address potential limitations in terms of scalability to larger datasets or multilingual capabilities. Additionally, the reliance on a specific pre-trained model (Whisper) for representation alignment may limit the generalizability of the approach. The paper also notes that speaker similarity metrics are lower than some baselines, suggesting room for improvement in that area.
The advancements in controllable TTS have significant implications for applications in virtual assistants, audiobooks, and any domain requiring personalized speech synthesis. The ability to independently manipulate style and timbre enhances user experience and could lead to more engaging human-computer interactions. The work may also inspire further research into disentangled representations in other domains of machine learning. The paper presents DMP-TTS, a controllable TTS framework that effectively disentangles speaker timbre and speaking style through innovative methodologies. The contributions are significant, addressing key challenges in the field and providing a solid foundation for future research and applications in controllable speech synthesis.